All of lore.kernel.org
 help / color / mirror / Atom feed
* Question about Experimental of Filesystem DAX.
@ 2018-05-31  2:27 Yasunori Goto
  2018-05-31 15:07   ` Ross Zwisler
  0 siblings, 1 reply; 31+ messages in thread
From: Yasunori Goto @ 2018-05-31  2:27 UTC (permalink / raw)
  To: NVDIMM-ML

Hello,


I would like to know about the Experimental message of Filesystem DAX.
--------------------------------------------------------
DAX enabled. Warning: EXPERIMENTAL, use at your own risk
--------------------------------------------------------

AFAIK, the final issue of Filesystem DAX is metadata update problem, 
and it is(will be?) solved by great effort of MAP_SYNC and
"fix dma vs truncate/hole-punch" patch set. 
So, I suppose that the Experimental message can be removed,
but I'm not sure.

Is it possible?
Otherwise, are there any other issues in Filesystem DAX yet?

If this is silly question, sorry for noise....

Thanks,
---
Yasunori Goto



_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Question about Experimental of Filesystem DAX.
@ 2018-05-31 15:07   ` Ross Zwisler
  0 siblings, 0 replies; 31+ messages in thread
From: Ross Zwisler @ 2018-05-31 15:07 UTC (permalink / raw)
  To: Yasunori Goto, Jan Kara, Darrick J. Wong; +Cc: linux-xfs, linux-ext4, NVDIMM-ML

On Thu, May 31, 2018 at 11:27:33AM +0900, Yasunori Goto wrote:
> Hello,
> 
> 
> I would like to know about the Experimental message of Filesystem DAX.
> --------------------------------------------------------
> DAX enabled. Warning: EXPERIMENTAL, use at your own risk
> --------------------------------------------------------
> 
> AFAIK, the final issue of Filesystem DAX is metadata update problem, 
> and it is(will be?) solved by great effort of MAP_SYNC and
> "fix dma vs truncate/hole-punch" patch set. 
> So, I suppose that the Experimental message can be removed,
> but I'm not sure.
> 
> Is it possible?
> Otherwise, are there any other issues in Filesystem DAX yet?
> 
> If this is silly question, sorry for noise....
> 
> Thanks,
> ---
> Yasunori Goto

Adding in the XFS and ext4 developers, as it's really their call when to
remove this notice.

We've talked about this off and on for a long while, but IMHO we should remove
the EXPERIMENTAL warning.  The last few things that we had on our TODO list
before this was removed were:

1) Get consistent handling of the DAX mount option.  We currently have this,
as both filesystems will behave the same and fall back and remove the DAX
mount option if it is unsupported by the block device, etc.

2) Get consistent handling of the DAX inode option.  We currently have this,
as all DAX behavior now happens through the mount option.  If/when we
re-enable the per-inode DAX flag we should do it consistently for all DAX
enabled filesystems.

3) Make DAX work with other XFS features like reflink, etc.  This one isn't
done, but we at least disallow DAX with XFS features like reflink where it
could be an issue.  Darrick, do you still feel like we need to get these
working together to remove EXPERIMENTAL, or are you happy enough that we're
keeping them separated and that we're keeping user data safe?

Jan and the other ext4 guys, do you have any additional things you need done
before removing the EXPERIMENTAL warning from ext4 + DAX?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Question about Experimental of Filesystem DAX.
@ 2018-05-31 15:07   ` Ross Zwisler
  0 siblings, 0 replies; 31+ messages in thread
From: Ross Zwisler @ 2018-05-31 15:07 UTC (permalink / raw)
  To: Yasunori Goto, Jan Kara, Darrick J. Wong
  Cc: linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, NVDIMM-ML

On Thu, May 31, 2018 at 11:27:33AM +0900, Yasunori Goto wrote:
> Hello,
> 
> 
> I would like to know about the Experimental message of Filesystem DAX.
> --------------------------------------------------------
> DAX enabled. Warning: EXPERIMENTAL, use at your own risk
> --------------------------------------------------------
> 
> AFAIK, the final issue of Filesystem DAX is metadata update problem, 
> and it is(will be?) solved by great effort of MAP_SYNC and
> "fix dma vs truncate/hole-punch" patch set. 
> So, I suppose that the Experimental message can be removed,
> but I'm not sure.
> 
> Is it possible?
> Otherwise, are there any other issues in Filesystem DAX yet?
> 
> If this is silly question, sorry for noise....
> 
> Thanks,
> ---
> Yasunori Goto

Adding in the XFS and ext4 developers, as it's really their call when to
remove this notice.

We've talked about this off and on for a long while, but IMHO we should remove
the EXPERIMENTAL warning.  The last few things that we had on our TODO list
before this was removed were:

1) Get consistent handling of the DAX mount option.  We currently have this,
as both filesystems will behave the same and fall back and remove the DAX
mount option if it is unsupported by the block device, etc.

2) Get consistent handling of the DAX inode option.  We currently have this,
as all DAX behavior now happens through the mount option.  If/when we
re-enable the per-inode DAX flag we should do it consistently for all DAX
enabled filesystems.

3) Make DAX work with other XFS features like reflink, etc.  This one isn't
done, but we at least disallow DAX with XFS features like reflink where it
could be an issue.  Darrick, do you still feel like we need to get these
working together to remove EXPERIMENTAL, or are you happy enough that we're
keeping them separated and that we're keeping user data safe?

Jan and the other ext4 guys, do you have any additional things you need done
before removing the EXPERIMENTAL warning from ext4 + DAX?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Question about Experimental of Filesystem DAX.
@ 2018-05-31 15:07   ` Ross Zwisler
  0 siblings, 0 replies; 31+ messages in thread
From: Ross Zwisler @ 2018-05-31 15:07 UTC (permalink / raw)
  To: Yasunori Goto, Jan Kara, Darrick J. Wong; +Cc: NVDIMM-ML, linux-ext4, linux-xfs

On Thu, May 31, 2018 at 11:27:33AM +0900, Yasunori Goto wrote:
> Hello,
> 
> 
> I would like to know about the Experimental message of Filesystem DAX.
> --------------------------------------------------------
> DAX enabled. Warning: EXPERIMENTAL, use at your own risk
> --------------------------------------------------------
> 
> AFAIK, the final issue of Filesystem DAX is metadata update problem, 
> and it is(will be?) solved by great effort of MAP_SYNC and
> "fix dma vs truncate/hole-punch" patch set. 
> So, I suppose that the Experimental message can be removed,
> but I'm not sure.
> 
> Is it possible?
> Otherwise, are there any other issues in Filesystem DAX yet?
> 
> If this is silly question, sorry for noise....
> 
> Thanks,
> ---
> Yasunori Goto

Adding in the XFS and ext4 developers, as it's really their call when to
remove this notice.

We've talked about this off and on for a long while, but IMHO we should remove
the EXPERIMENTAL warning.  The last few things that we had on our TODO list
before this was removed were:

1) Get consistent handling of the DAX mount option.  We currently have this,
as both filesystems will behave the same and fall back and remove the DAX
mount option if it is unsupported by the block device, etc.

2) Get consistent handling of the DAX inode option.  We currently have this,
as all DAX behavior now happens through the mount option.  If/when we
re-enable the per-inode DAX flag we should do it consistently for all DAX
enabled filesystems.

3) Make DAX work with other XFS features like reflink, etc.  This one isn't
done, but we at least disallow DAX with XFS features like reflink where it
could be an issue.  Darrick, do you still feel like we need to get these
working together to remove EXPERIMENTAL, or are you happy enough that we're
keeping them separated and that we're keeping user data safe?

Jan and the other ext4 guys, do you have any additional things you need done
before removing the EXPERIMENTAL warning from ext4 + DAX?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Question about Experimental of Filesystem DAX.
@ 2018-05-31 16:29     ` Dan Williams
  0 siblings, 0 replies; 31+ messages in thread
From: Dan Williams @ 2018-05-31 16:29 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, NVDIMM-ML, Darrick J. Wong, linux-xfs, Yasunori Goto,
	linux-ext4

On Thu, May 31, 2018 at 8:07 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Thu, May 31, 2018 at 11:27:33AM +0900, Yasunori Goto wrote:
>> Hello,
>>
>>
>> I would like to know about the Experimental message of Filesystem DAX.
>> --------------------------------------------------------
>> DAX enabled. Warning: EXPERIMENTAL, use at your own risk
>> --------------------------------------------------------
>>
>> AFAIK, the final issue of Filesystem DAX is metadata update problem,
>> and it is(will be?) solved by great effort of MAP_SYNC and
>> "fix dma vs truncate/hole-punch" patch set.
>> So, I suppose that the Experimental message can be removed,
>> but I'm not sure.
>>
>> Is it possible?
>> Otherwise, are there any other issues in Filesystem DAX yet?
>>
>> If this is silly question, sorry for noise....
>>
>> Thanks,
>> ---
>> Yasunori Goto
>
> Adding in the XFS and ext4 developers, as it's really their call when to
> remove this notice.
>
> We've talked about this off and on for a long while, but IMHO we should remove
> the EXPERIMENTAL warning.  The last few things that we had on our TODO list
> before this was removed were:
>
> 1) Get consistent handling of the DAX mount option.  We currently have this,
> as both filesystems will behave the same and fall back and remove the DAX
> mount option if it is unsupported by the block device, etc.
>
> 2) Get consistent handling of the DAX inode option.  We currently have this,
> as all DAX behavior now happens through the mount option.  If/when we
> re-enable the per-inode DAX flag we should do it consistently for all DAX
> enabled filesystems.
>
> 3) Make DAX work with other XFS features like reflink, etc.  This one isn't
> done, but we at least disallow DAX with XFS features like reflink where it
> could be an issue.  Darrick, do you still feel like we need to get these
> working together to remove EXPERIMENTAL, or are you happy enough that we're
> keeping them separated and that we're keeping user data safe?
>
> Jan and the other ext4 guys, do you have any additional things you need done
> before removing the EXPERIMENTAL warning from ext4 + DAX?

The one's on my list are:

1/ Get proper support for recovering userspace consumed poison in DAX
mappings (may not make 4.18)

2/ The DAX-DMA vs Truncate fix (queued for 4.18).
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Question about Experimental of Filesystem DAX.
@ 2018-05-31 16:29     ` Dan Williams
  0 siblings, 0 replies; 31+ messages in thread
From: Dan Williams @ 2018-05-31 16:29 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, NVDIMM-ML, Darrick J. Wong, linux-xfs, Yasunori Goto,
	linux-ext4

On Thu, May 31, 2018 at 8:07 AM, Ross Zwisler
<ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> wrote:
> On Thu, May 31, 2018 at 11:27:33AM +0900, Yasunori Goto wrote:
>> Hello,
>>
>>
>> I would like to know about the Experimental message of Filesystem DAX.
>> --------------------------------------------------------
>> DAX enabled. Warning: EXPERIMENTAL, use at your own risk
>> --------------------------------------------------------
>>
>> AFAIK, the final issue of Filesystem DAX is metadata update problem,
>> and it is(will be?) solved by great effort of MAP_SYNC and
>> "fix dma vs truncate/hole-punch" patch set.
>> So, I suppose that the Experimental message can be removed,
>> but I'm not sure.
>>
>> Is it possible?
>> Otherwise, are there any other issues in Filesystem DAX yet?
>>
>> If this is silly question, sorry for noise....
>>
>> Thanks,
>> ---
>> Yasunori Goto
>
> Adding in the XFS and ext4 developers, as it's really their call when to
> remove this notice.
>
> We've talked about this off and on for a long while, but IMHO we should remove
> the EXPERIMENTAL warning.  The last few things that we had on our TODO list
> before this was removed were:
>
> 1) Get consistent handling of the DAX mount option.  We currently have this,
> as both filesystems will behave the same and fall back and remove the DAX
> mount option if it is unsupported by the block device, etc.
>
> 2) Get consistent handling of the DAX inode option.  We currently have this,
> as all DAX behavior now happens through the mount option.  If/when we
> re-enable the per-inode DAX flag we should do it consistently for all DAX
> enabled filesystems.
>
> 3) Make DAX work with other XFS features like reflink, etc.  This one isn't
> done, but we at least disallow DAX with XFS features like reflink where it
> could be an issue.  Darrick, do you still feel like we need to get these
> working together to remove EXPERIMENTAL, or are you happy enough that we're
> keeping them separated and that we're keeping user data safe?
>
> Jan and the other ext4 guys, do you have any additional things you need done
> before removing the EXPERIMENTAL warning from ext4 + DAX?

The one's on my list are:

1/ Get proper support for recovering userspace consumed poison in DAX
mappings (may not make 4.18)

2/ The DAX-DMA vs Truncate fix (queued for 4.18).

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Question about Experimental of Filesystem DAX.
@ 2018-05-31 16:29     ` Dan Williams
  0 siblings, 0 replies; 31+ messages in thread
From: Dan Williams @ 2018-05-31 16:29 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Yasunori Goto, Jan Kara, Darrick J. Wong, linux-xfs, linux-ext4,
	NVDIMM-ML

On Thu, May 31, 2018 at 8:07 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Thu, May 31, 2018 at 11:27:33AM +0900, Yasunori Goto wrote:
>> Hello,
>>
>>
>> I would like to know about the Experimental message of Filesystem DAX.
>> --------------------------------------------------------
>> DAX enabled. Warning: EXPERIMENTAL, use at your own risk
>> --------------------------------------------------------
>>
>> AFAIK, the final issue of Filesystem DAX is metadata update problem,
>> and it is(will be?) solved by great effort of MAP_SYNC and
>> "fix dma vs truncate/hole-punch" patch set.
>> So, I suppose that the Experimental message can be removed,
>> but I'm not sure.
>>
>> Is it possible?
>> Otherwise, are there any other issues in Filesystem DAX yet?
>>
>> If this is silly question, sorry for noise....
>>
>> Thanks,
>> ---
>> Yasunori Goto
>
> Adding in the XFS and ext4 developers, as it's really their call when to
> remove this notice.
>
> We've talked about this off and on for a long while, but IMHO we should remove
> the EXPERIMENTAL warning.  The last few things that we had on our TODO list
> before this was removed were:
>
> 1) Get consistent handling of the DAX mount option.  We currently have this,
> as both filesystems will behave the same and fall back and remove the DAX
> mount option if it is unsupported by the block device, etc.
>
> 2) Get consistent handling of the DAX inode option.  We currently have this,
> as all DAX behavior now happens through the mount option.  If/when we
> re-enable the per-inode DAX flag we should do it consistently for all DAX
> enabled filesystems.
>
> 3) Make DAX work with other XFS features like reflink, etc.  This one isn't
> done, but we at least disallow DAX with XFS features like reflink where it
> could be an issue.  Darrick, do you still feel like we need to get these
> working together to remove EXPERIMENTAL, or are you happy enough that we're
> keeping them separated and that we're keeping user data safe?
>
> Jan and the other ext4 guys, do you have any additional things you need done
> before removing the EXPERIMENTAL warning from ext4 + DAX?

The one's on my list are:

1/ Get proper support for recovering userspace consumed poison in DAX
mappings (may not make 4.18)

2/ The DAX-DMA vs Truncate fix (queued for 4.18).

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Question about Experimental of Filesystem DAX.
@ 2018-05-31 17:46       ` Darrick J. Wong
  0 siblings, 0 replies; 31+ messages in thread
From: Darrick J. Wong @ 2018-05-31 17:46 UTC (permalink / raw)
  To: Dan Williams; +Cc: Jan Kara, NVDIMM-ML, linux-xfs, Yasunori Goto, linux-ext4

On Thu, May 31, 2018 at 09:29:15AM -0700, Dan Williams wrote:
> On Thu, May 31, 2018 at 8:07 AM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > On Thu, May 31, 2018 at 11:27:33AM +0900, Yasunori Goto wrote:
> >> Hello,
> >>
> >>
> >> I would like to know about the Experimental message of Filesystem DAX.
> >> --------------------------------------------------------
> >> DAX enabled. Warning: EXPERIMENTAL, use at your own risk
> >> --------------------------------------------------------
> >>
> >> AFAIK, the final issue of Filesystem DAX is metadata update problem,
> >> and it is(will be?) solved by great effort of MAP_SYNC and
> >> "fix dma vs truncate/hole-punch" patch set.
> >> So, I suppose that the Experimental message can be removed,
> >> but I'm not sure.
> >>
> >> Is it possible?
> >> Otherwise, are there any other issues in Filesystem DAX yet?
> >>
> >> If this is silly question, sorry for noise....
> >>
> >> Thanks,
> >> ---
> >> Yasunori Goto
> >
> > Adding in the XFS and ext4 developers, as it's really their call when to
> > remove this notice.
> >
> > We've talked about this off and on for a long while, but IMHO we should remove
> > the EXPERIMENTAL warning.  The last few things that we had on our TODO list
> > before this was removed were:
> >
> > 1) Get consistent handling of the DAX mount option.  We currently have this,
> > as both filesystems will behave the same and fall back and remove the DAX
> > mount option if it is unsupported by the block device, etc.

<nod>

As an aside, I wonder if Christoph's musings about "just have the kernel
determine the appropriate dax/non-dax setting from the acpi tables and
skip the inode flag entirely" ever got resolved?

> > 2) Get consistent handling of the DAX inode option.  We currently have this,
> > as all DAX behavior now happens through the mount option.  If/when we
> > re-enable the per-inode DAX flag we should do it consistently for all DAX
> > enabled filesystems.

The behavior of the inode flag isn't all that consistent.  ext4 doesn't
support it at all.  On XFS, you can set or clear FS_XFLAG_DAX on a
directory which will propagate the setting to any files created in that
directory.

However, if you set or clear it on a file we update the on-disk inode
but we can't change the in-core state flag (S_DAX) until the next
in-core inode instantiation.  It's weird that users can change the flag
but the intended behavior changes won't happen until some ... time ...
in the future??

> > 3) Make DAX work with other XFS features like reflink, etc.  This one isn't
> > done, but we at least disallow DAX with XFS features like reflink where it
> > could be an issue.  Darrick, do you still feel like we need to get these
> > working together to remove EXPERIMENTAL, or are you happy enough that we're
> > keeping them separated and that we're keeping user data safe?

Yes, reflink and dax still need to work together.  I've not heard any
good arguments for why page sharing + copy on write are fundamentally
incompatible with the dax model, or why dax users will never, ever
require reflink.

The recent thread between Jan and Dan make me wonder if making mappings
share struct pages is going to be a nightmare to add to the mm code,
though...

Also: ideally XFS would also be able to consume poison event
notifications from the pmem so that it can try to deal with metadata
loss, but that's probably a separate effort.

--D

> > Jan and the other ext4 guys, do you have any additional things you need done
> > before removing the EXPERIMENTAL warning from ext4 + DAX?
> 
> The one's on my list are:
> 
> 1/ Get proper support for recovering userspace consumed poison in DAX
> mappings (may not make 4.18)
> 
> 2/ The DAX-DMA vs Truncate fix (queued for 4.18).
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Question about Experimental of Filesystem DAX.
@ 2018-05-31 17:46       ` Darrick J. Wong
  0 siblings, 0 replies; 31+ messages in thread
From: Darrick J. Wong @ 2018-05-31 17:46 UTC (permalink / raw)
  To: Dan Williams; +Cc: Jan Kara, NVDIMM-ML, linux-xfs, Yasunori Goto, linux-ext4

On Thu, May 31, 2018 at 09:29:15AM -0700, Dan Williams wrote:
> On Thu, May 31, 2018 at 8:07 AM, Ross Zwisler
> <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> wrote:
> > On Thu, May 31, 2018 at 11:27:33AM +0900, Yasunori Goto wrote:
> >> Hello,
> >>
> >>
> >> I would like to know about the Experimental message of Filesystem DAX.
> >> --------------------------------------------------------
> >> DAX enabled. Warning: EXPERIMENTAL, use at your own risk
> >> --------------------------------------------------------
> >>
> >> AFAIK, the final issue of Filesystem DAX is metadata update problem,
> >> and it is(will be?) solved by great effort of MAP_SYNC and
> >> "fix dma vs truncate/hole-punch" patch set.
> >> So, I suppose that the Experimental message can be removed,
> >> but I'm not sure.
> >>
> >> Is it possible?
> >> Otherwise, are there any other issues in Filesystem DAX yet?
> >>
> >> If this is silly question, sorry for noise....
> >>
> >> Thanks,
> >> ---
> >> Yasunori Goto
> >
> > Adding in the XFS and ext4 developers, as it's really their call when to
> > remove this notice.
> >
> > We've talked about this off and on for a long while, but IMHO we should remove
> > the EXPERIMENTAL warning.  The last few things that we had on our TODO list
> > before this was removed were:
> >
> > 1) Get consistent handling of the DAX mount option.  We currently have this,
> > as both filesystems will behave the same and fall back and remove the DAX
> > mount option if it is unsupported by the block device, etc.

<nod>

As an aside, I wonder if Christoph's musings about "just have the kernel
determine the appropriate dax/non-dax setting from the acpi tables and
skip the inode flag entirely" ever got resolved?

> > 2) Get consistent handling of the DAX inode option.  We currently have this,
> > as all DAX behavior now happens through the mount option.  If/when we
> > re-enable the per-inode DAX flag we should do it consistently for all DAX
> > enabled filesystems.

The behavior of the inode flag isn't all that consistent.  ext4 doesn't
support it at all.  On XFS, you can set or clear FS_XFLAG_DAX on a
directory which will propagate the setting to any files created in that
directory.

However, if you set or clear it on a file we update the on-disk inode
but we can't change the in-core state flag (S_DAX) until the next
in-core inode instantiation.  It's weird that users can change the flag
but the intended behavior changes won't happen until some ... time ...
in the future??

> > 3) Make DAX work with other XFS features like reflink, etc.  This one isn't
> > done, but we at least disallow DAX with XFS features like reflink where it
> > could be an issue.  Darrick, do you still feel like we need to get these
> > working together to remove EXPERIMENTAL, or are you happy enough that we're
> > keeping them separated and that we're keeping user data safe?

Yes, reflink and dax still need to work together.  I've not heard any
good arguments for why page sharing + copy on write are fundamentally
incompatible with the dax model, or why dax users will never, ever
require reflink.

The recent thread between Jan and Dan make me wonder if making mappings
share struct pages is going to be a nightmare to add to the mm code,
though...

Also: ideally XFS would also be able to consume poison event
notifications from the pmem so that it can try to deal with metadata
loss, but that's probably a separate effort.

--D

> > Jan and the other ext4 guys, do you have any additional things you need done
> > before removing the EXPERIMENTAL warning from ext4 + DAX?
> 
> The one's on my list are:
> 
> 1/ Get proper support for recovering userspace consumed poison in DAX
> mappings (may not make 4.18)
> 
> 2/ The DAX-DMA vs Truncate fix (queued for 4.18).
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Question about Experimental of Filesystem DAX.
@ 2018-05-31 17:46       ` Darrick J. Wong
  0 siblings, 0 replies; 31+ messages in thread
From: Darrick J. Wong @ 2018-05-31 17:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, Yasunori Goto, Jan Kara, linux-xfs, linux-ext4, NVDIMM-ML

On Thu, May 31, 2018 at 09:29:15AM -0700, Dan Williams wrote:
> On Thu, May 31, 2018 at 8:07 AM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > On Thu, May 31, 2018 at 11:27:33AM +0900, Yasunori Goto wrote:
> >> Hello,
> >>
> >>
> >> I would like to know about the Experimental message of Filesystem DAX.
> >> --------------------------------------------------------
> >> DAX enabled. Warning: EXPERIMENTAL, use at your own risk
> >> --------------------------------------------------------
> >>
> >> AFAIK, the final issue of Filesystem DAX is metadata update problem,
> >> and it is(will be?) solved by great effort of MAP_SYNC and
> >> "fix dma vs truncate/hole-punch" patch set.
> >> So, I suppose that the Experimental message can be removed,
> >> but I'm not sure.
> >>
> >> Is it possible?
> >> Otherwise, are there any other issues in Filesystem DAX yet?
> >>
> >> If this is silly question, sorry for noise....
> >>
> >> Thanks,
> >> ---
> >> Yasunori Goto
> >
> > Adding in the XFS and ext4 developers, as it's really their call when to
> > remove this notice.
> >
> > We've talked about this off and on for a long while, but IMHO we should remove
> > the EXPERIMENTAL warning.  The last few things that we had on our TODO list
> > before this was removed were:
> >
> > 1) Get consistent handling of the DAX mount option.  We currently have this,
> > as both filesystems will behave the same and fall back and remove the DAX
> > mount option if it is unsupported by the block device, etc.

<nod>

As an aside, I wonder if Christoph's musings about "just have the kernel
determine the appropriate dax/non-dax setting from the acpi tables and
skip the inode flag entirely" ever got resolved?

> > 2) Get consistent handling of the DAX inode option.  We currently have this,
> > as all DAX behavior now happens through the mount option.  If/when we
> > re-enable the per-inode DAX flag we should do it consistently for all DAX
> > enabled filesystems.

The behavior of the inode flag isn't all that consistent.  ext4 doesn't
support it at all.  On XFS, you can set or clear FS_XFLAG_DAX on a
directory which will propagate the setting to any files created in that
directory.

However, if you set or clear it on a file we update the on-disk inode
but we can't change the in-core state flag (S_DAX) until the next
in-core inode instantiation.  It's weird that users can change the flag
but the intended behavior changes won't happen until some ... time ...
in the future??

> > 3) Make DAX work with other XFS features like reflink, etc.  This one isn't
> > done, but we at least disallow DAX with XFS features like reflink where it
> > could be an issue.  Darrick, do you still feel like we need to get these
> > working together to remove EXPERIMENTAL, or are you happy enough that we're
> > keeping them separated and that we're keeping user data safe?

Yes, reflink and dax still need to work together.  I've not heard any
good arguments for why page sharing + copy on write are fundamentally
incompatible with the dax model, or why dax users will never, ever
require reflink.

The recent thread between Jan and Dan make me wonder if making mappings
share struct pages is going to be a nightmare to add to the mm code,
though...

Also: ideally XFS would also be able to consume poison event
notifications from the pmem so that it can try to deal with metadata
loss, but that's probably a separate effort.

--D

> > Jan and the other ext4 guys, do you have any additional things you need done
> > before removing the EXPERIMENTAL warning from ext4 + DAX?
> 
> The one's on my list are:
> 
> 1/ Get proper support for recovering userspace consumed poison in DAX
> mappings (may not make 4.18)
> 
> 2/ The DAX-DMA vs Truncate fix (queued for 4.18).
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Question about Experimental of Filesystem DAX.
  2018-05-31 17:46       ` Darrick J. Wong
  (?)
@ 2018-05-31 18:26         ` Dan Williams
  -1 siblings, 0 replies; 31+ messages in thread
From: Dan Williams @ 2018-05-31 18:26 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Jan Kara, NVDIMM-ML, linux-xfs, Yasunori Goto, linux-ext4

On Thu, May 31, 2018 at 10:46 AM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> On Thu, May 31, 2018 at 09:29:15AM -0700, Dan Williams wrote:
>> On Thu, May 31, 2018 at 8:07 AM, Ross Zwisler
>> <ross.zwisler@linux.intel.com> wrote:
>> > On Thu, May 31, 2018 at 11:27:33AM +0900, Yasunori Goto wrote:
>> >> Hello,
>> >>
>> >>
>> >> I would like to know about the Experimental message of Filesystem DAX.
>> >> --------------------------------------------------------
>> >> DAX enabled. Warning: EXPERIMENTAL, use at your own risk
>> >> --------------------------------------------------------
>> >>
>> >> AFAIK, the final issue of Filesystem DAX is metadata update problem,
>> >> and it is(will be?) solved by great effort of MAP_SYNC and
>> >> "fix dma vs truncate/hole-punch" patch set.
>> >> So, I suppose that the Experimental message can be removed,
>> >> but I'm not sure.
>> >>
>> >> Is it possible?
>> >> Otherwise, are there any other issues in Filesystem DAX yet?
>> >>
>> >> If this is silly question, sorry for noise....
>> >>
>> >> Thanks,
>> >> ---
>> >> Yasunori Goto
>> >
>> > Adding in the XFS and ext4 developers, as it's really their call when to
>> > remove this notice.
>> >
>> > We've talked about this off and on for a long while, but IMHO we should remove
>> > the EXPERIMENTAL warning.  The last few things that we had on our TODO list
>> > before this was removed were:
>> >
>> > 1) Get consistent handling of the DAX mount option.  We currently have this,
>> > as both filesystems will behave the same and fall back and remove the DAX
>> > mount option if it is unsupported by the block device, etc.
>
> <nod>
>
> As an aside, I wonder if Christoph's musings about "just have the kernel
> determine the appropriate dax/non-dax setting from the acpi tables and
> skip the inode flag entirely" ever got resolved?
>
>> > 2) Get consistent handling of the DAX inode option.  We currently have this,
>> > as all DAX behavior now happens through the mount option.  If/when we
>> > re-enable the per-inode DAX flag we should do it consistently for all DAX
>> > enabled filesystems.
>
> The behavior of the inode flag isn't all that consistent.  ext4 doesn't
> support it at all.  On XFS, you can set or clear FS_XFLAG_DAX on a
> directory which will propagate the setting to any files created in that
> directory.
>
> However, if you set or clear it on a file we update the on-disk inode
> but we can't change the in-core state flag (S_DAX) until the next
> in-core inode instantiation.  It's weird that users can change the flag
> but the intended behavior changes won't happen until some ... time ...
> in the future??
>
>> > 3) Make DAX work with other XFS features like reflink, etc.  This one isn't
>> > done, but we at least disallow DAX with XFS features like reflink where it
>> > could be an issue.  Darrick, do you still feel like we need to get these
>> > working together to remove EXPERIMENTAL, or are you happy enough that we're
>> > keeping them separated and that we're keeping user data safe?
>
> Yes, reflink and dax still need to work together.  I've not heard any
> good arguments for why page sharing + copy on write are fundamentally
> incompatible with the dax model, or why dax users will never, ever
> require reflink.

Right, but that's separate from DAX being scream in your face
"EXPERIMENTAL!". It's just an additional feature that can be added on
once all the normal expectations of a userspace mapping work. I think
reliable rmap is the last of those requirements.

> The recent thread between Jan and Dan make me wonder if making mappings
> share struct pages is going to be a nightmare to add to the mm code,
> though...

It's going to be a bit messy because a singular page->mapping
association is fundamentally incompatible with DAX. Perhaps a linked
list of mapping "siblings"?

> Also: ideally XFS would also be able to consume poison event
> notifications from the pmem so that it can try to deal with metadata
> loss, but that's probably a separate effort.

Right, not a gating item for declaring DAX ready for prime time.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Question about Experimental of Filesystem DAX.
@ 2018-05-31 18:26         ` Dan Williams
  0 siblings, 0 replies; 31+ messages in thread
From: Dan Williams @ 2018-05-31 18:26 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Jan Kara, NVDIMM-ML, linux-xfs, Yasunori Goto, linux-ext4

On Thu, May 31, 2018 at 10:46 AM, Darrick J. Wong
<darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> On Thu, May 31, 2018 at 09:29:15AM -0700, Dan Williams wrote:
>> On Thu, May 31, 2018 at 8:07 AM, Ross Zwisler
>> <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> wrote:
>> > On Thu, May 31, 2018 at 11:27:33AM +0900, Yasunori Goto wrote:
>> >> Hello,
>> >>
>> >>
>> >> I would like to know about the Experimental message of Filesystem DAX.
>> >> --------------------------------------------------------
>> >> DAX enabled. Warning: EXPERIMENTAL, use at your own risk
>> >> --------------------------------------------------------
>> >>
>> >> AFAIK, the final issue of Filesystem DAX is metadata update problem,
>> >> and it is(will be?) solved by great effort of MAP_SYNC and
>> >> "fix dma vs truncate/hole-punch" patch set.
>> >> So, I suppose that the Experimental message can be removed,
>> >> but I'm not sure.
>> >>
>> >> Is it possible?
>> >> Otherwise, are there any other issues in Filesystem DAX yet?
>> >>
>> >> If this is silly question, sorry for noise....
>> >>
>> >> Thanks,
>> >> ---
>> >> Yasunori Goto
>> >
>> > Adding in the XFS and ext4 developers, as it's really their call when to
>> > remove this notice.
>> >
>> > We've talked about this off and on for a long while, but IMHO we should remove
>> > the EXPERIMENTAL warning.  The last few things that we had on our TODO list
>> > before this was removed were:
>> >
>> > 1) Get consistent handling of the DAX mount option.  We currently have this,
>> > as both filesystems will behave the same and fall back and remove the DAX
>> > mount option if it is unsupported by the block device, etc.
>
> <nod>
>
> As an aside, I wonder if Christoph's musings about "just have the kernel
> determine the appropriate dax/non-dax setting from the acpi tables and
> skip the inode flag entirely" ever got resolved?
>
>> > 2) Get consistent handling of the DAX inode option.  We currently have this,
>> > as all DAX behavior now happens through the mount option.  If/when we
>> > re-enable the per-inode DAX flag we should do it consistently for all DAX
>> > enabled filesystems.
>
> The behavior of the inode flag isn't all that consistent.  ext4 doesn't
> support it at all.  On XFS, you can set or clear FS_XFLAG_DAX on a
> directory which will propagate the setting to any files created in that
> directory.
>
> However, if you set or clear it on a file we update the on-disk inode
> but we can't change the in-core state flag (S_DAX) until the next
> in-core inode instantiation.  It's weird that users can change the flag
> but the intended behavior changes won't happen until some ... time ...
> in the future??
>
>> > 3) Make DAX work with other XFS features like reflink, etc.  This one isn't
>> > done, but we at least disallow DAX with XFS features like reflink where it
>> > could be an issue.  Darrick, do you still feel like we need to get these
>> > working together to remove EXPERIMENTAL, or are you happy enough that we're
>> > keeping them separated and that we're keeping user data safe?
>
> Yes, reflink and dax still need to work together.  I've not heard any
> good arguments for why page sharing + copy on write are fundamentally
> incompatible with the dax model, or why dax users will never, ever
> require reflink.

Right, but that's separate from DAX being scream in your face
"EXPERIMENTAL!". It's just an additional feature that can be added on
once all the normal expectations of a userspace mapping work. I think
reliable rmap is the last of those requirements.

> The recent thread between Jan and Dan make me wonder if making mappings
> share struct pages is going to be a nightmare to add to the mm code,
> though...

It's going to be a bit messy because a singular page->mapping
association is fundamentally incompatible with DAX. Perhaps a linked
list of mapping "siblings"?

> Also: ideally XFS would also be able to consume poison event
> notifications from the pmem so that it can try to deal with metadata
> loss, but that's probably a separate effort.

Right, not a gating item for declaring DAX ready for prime time.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Question about Experimental of Filesystem DAX.
@ 2018-05-31 18:26         ` Dan Williams
  0 siblings, 0 replies; 31+ messages in thread
From: Dan Williams @ 2018-05-31 18:26 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Ross Zwisler, Yasunori Goto, Jan Kara, linux-xfs, linux-ext4, NVDIMM-ML

On Thu, May 31, 2018 at 10:46 AM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> On Thu, May 31, 2018 at 09:29:15AM -0700, Dan Williams wrote:
>> On Thu, May 31, 2018 at 8:07 AM, Ross Zwisler
>> <ross.zwisler@linux.intel.com> wrote:
>> > On Thu, May 31, 2018 at 11:27:33AM +0900, Yasunori Goto wrote:
>> >> Hello,
>> >>
>> >>
>> >> I would like to know about the Experimental message of Filesystem DAX.
>> >> --------------------------------------------------------
>> >> DAX enabled. Warning: EXPERIMENTAL, use at your own risk
>> >> --------------------------------------------------------
>> >>
>> >> AFAIK, the final issue of Filesystem DAX is metadata update problem,
>> >> and it is(will be?) solved by great effort of MAP_SYNC and
>> >> "fix dma vs truncate/hole-punch" patch set.
>> >> So, I suppose that the Experimental message can be removed,
>> >> but I'm not sure.
>> >>
>> >> Is it possible?
>> >> Otherwise, are there any other issues in Filesystem DAX yet?
>> >>
>> >> If this is silly question, sorry for noise....
>> >>
>> >> Thanks,
>> >> ---
>> >> Yasunori Goto
>> >
>> > Adding in the XFS and ext4 developers, as it's really their call when to
>> > remove this notice.
>> >
>> > We've talked about this off and on for a long while, but IMHO we should remove
>> > the EXPERIMENTAL warning.  The last few things that we had on our TODO list
>> > before this was removed were:
>> >
>> > 1) Get consistent handling of the DAX mount option.  We currently have this,
>> > as both filesystems will behave the same and fall back and remove the DAX
>> > mount option if it is unsupported by the block device, etc.
>
> <nod>
>
> As an aside, I wonder if Christoph's musings about "just have the kernel
> determine the appropriate dax/non-dax setting from the acpi tables and
> skip the inode flag entirely" ever got resolved?
>
>> > 2) Get consistent handling of the DAX inode option.  We currently have this,
>> > as all DAX behavior now happens through the mount option.  If/when we
>> > re-enable the per-inode DAX flag we should do it consistently for all DAX
>> > enabled filesystems.
>
> The behavior of the inode flag isn't all that consistent.  ext4 doesn't
> support it at all.  On XFS, you can set or clear FS_XFLAG_DAX on a
> directory which will propagate the setting to any files created in that
> directory.
>
> However, if you set or clear it on a file we update the on-disk inode
> but we can't change the in-core state flag (S_DAX) until the next
> in-core inode instantiation.  It's weird that users can change the flag
> but the intended behavior changes won't happen until some ... time ...
> in the future??
>
>> > 3) Make DAX work with other XFS features like reflink, etc.  This one isn't
>> > done, but we at least disallow DAX with XFS features like reflink where it
>> > could be an issue.  Darrick, do you still feel like we need to get these
>> > working together to remove EXPERIMENTAL, or are you happy enough that we're
>> > keeping them separated and that we're keeping user data safe?
>
> Yes, reflink and dax still need to work together.  I've not heard any
> good arguments for why page sharing + copy on write are fundamentally
> incompatible with the dax model, or why dax users will never, ever
> require reflink.

Right, but that's separate from DAX being scream in your face
"EXPERIMENTAL!". It's just an additional feature that can be added on
once all the normal expectations of a userspace mapping work. I think
reliable rmap is the last of those requirements.

> The recent thread between Jan and Dan make me wonder if making mappings
> share struct pages is going to be a nightmare to add to the mm code,
> though...

It's going to be a bit messy because a singular page->mapping
association is fundamentally incompatible with DAX. Perhaps a linked
list of mapping "siblings"?

> Also: ideally XFS would also be able to consume poison event
> notifications from the pmem so that it can try to deal with metadata
> loss, but that's probably a separate effort.

Right, not a gating item for declaring DAX ready for prime time.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Question about Experimental of Filesystem DAX.
@ 2018-05-31 20:25           ` Ross Zwisler
  0 siblings, 0 replies; 31+ messages in thread
From: Ross Zwisler @ 2018-05-31 20:25 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Darrick J. Wong, NVDIMM-ML, linux-xfs, Yasunori Goto,
	linux-ext4

On Thu, May 31, 2018 at 11:26:43AM -0700, Dan Williams wrote:
> On Thu, May 31, 2018 at 10:46 AM, Darrick J. Wong
> <darrick.wong@oracle.com> wrote:
> > On Thu, May 31, 2018 at 09:29:15AM -0700, Dan Williams wrote:
> >> On Thu, May 31, 2018 at 8:07 AM, Ross Zwisler
> >> <ross.zwisler@linux.intel.com> wrote:
> >> > On Thu, May 31, 2018 at 11:27:33AM +0900, Yasunori Goto wrote:
> >> >> Hello,
> >> >>
> >> >>
> >> >> I would like to know about the Experimental message of Filesystem DAX.
> >> >> --------------------------------------------------------
> >> >> DAX enabled. Warning: EXPERIMENTAL, use at your own risk
> >> >> --------------------------------------------------------
> >> >>
> >> >> AFAIK, the final issue of Filesystem DAX is metadata update problem,
> >> >> and it is(will be?) solved by great effort of MAP_SYNC and
> >> >> "fix dma vs truncate/hole-punch" patch set.
> >> >> So, I suppose that the Experimental message can be removed,
> >> >> but I'm not sure.
> >> >>
> >> >> Is it possible?
> >> >> Otherwise, are there any other issues in Filesystem DAX yet?
> >> >>
> >> >> If this is silly question, sorry for noise....
> >> >>
> >> >> Thanks,
> >> >> ---
> >> >> Yasunori Goto
> >> >
> >> > Adding in the XFS and ext4 developers, as it's really their call when to
> >> > remove this notice.
> >> >
> >> > We've talked about this off and on for a long while, but IMHO we should remove
> >> > the EXPERIMENTAL warning.  The last few things that we had on our TODO list
> >> > before this was removed were:
> >> >
> >> > 1) Get consistent handling of the DAX mount option.  We currently have this,
> >> > as both filesystems will behave the same and fall back and remove the DAX
> >> > mount option if it is unsupported by the block device, etc.
> >
> > <nod>
> >
> > As an aside, I wonder if Christoph's musings about "just have the kernel
> > determine the appropriate dax/non-dax setting from the acpi tables and
> > skip the inode flag entirely" ever got resolved?
> >
> >> > 2) Get consistent handling of the DAX inode option.  We currently have this,
> >> > as all DAX behavior now happens through the mount option.  If/when we
> >> > re-enable the per-inode DAX flag we should do it consistently for all DAX
> >> > enabled filesystems.
> >
> > The behavior of the inode flag isn't all that consistent.  ext4 doesn't
> > support it at all.  On XFS, you can set or clear FS_XFLAG_DAX on a
> > directory which will propagate the setting to any files created in that
> > directory.
> >
> > However, if you set or clear it on a file we update the on-disk inode
> > but we can't change the in-core state flag (S_DAX) until the next
> > in-core inode instantiation.  It's weird that users can change the flag
> > but the intended behavior changes won't happen until some ... time ...
> > in the future??
> >
> >> > 3) Make DAX work with other XFS features like reflink, etc.  This one isn't
> >> > done, but we at least disallow DAX with XFS features like reflink where it
> >> > could be an issue.  Darrick, do you still feel like we need to get these
> >> > working together to remove EXPERIMENTAL, or are you happy enough that we're
> >> > keeping them separated and that we're keeping user data safe?
> >
> > Yes, reflink and dax still need to work together.  I've not heard any
> > good arguments for why page sharing + copy on write are fundamentally
> > incompatible with the dax model, or why dax users will never, ever
> > require reflink.
> 
> Right, but that's separate from DAX being scream in your face
> "EXPERIMENTAL!". It's just an additional feature that can be added on
> once all the normal expectations of a userspace mapping work. I think
> reliable rmap is the last of those requirements.
> 
> > The recent thread between Jan and Dan make me wonder if making mappings
> > share struct pages is going to be a nightmare to add to the mm code,
> > though...
> 
> It's going to be a bit messy because a singular page->mapping
> association is fundamentally incompatible with DAX. Perhaps a linked
> list of mapping "siblings"?
> 
> > Also: ideally XFS would also be able to consume poison event
> > notifications from the pmem so that it can try to deal with metadata
> > loss, but that's probably a separate effort.
> 
> Right, not a gating item for declaring DAX ready for prime time.

Yep, I think that the very loud EXPERIMENTAL message is essentially telling
users "your data is at risk if you use this".  I totally agree that we still
have lots of work to do.  However, I don't think that these feature
enhancements should gate removal of the EXPERIMENTAL notice.   IMHO that
should only exist as long as we have issues that we know could corrupt data,
crash the box, etc.  As far as I know those are basically the 2 items on Dan's
list from a few mails ago (poison recovery & DMA vs truncate).
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Question about Experimental of Filesystem DAX.
@ 2018-05-31 20:25           ` Ross Zwisler
  0 siblings, 0 replies; 31+ messages in thread
From: Ross Zwisler @ 2018-05-31 20:25 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Darrick J. Wong, NVDIMM-ML, linux-xfs, Yasunori Goto,
	linux-ext4

On Thu, May 31, 2018 at 11:26:43AM -0700, Dan Williams wrote:
> On Thu, May 31, 2018 at 10:46 AM, Darrick J. Wong
> <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> > On Thu, May 31, 2018 at 09:29:15AM -0700, Dan Williams wrote:
> >> On Thu, May 31, 2018 at 8:07 AM, Ross Zwisler
> >> <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> wrote:
> >> > On Thu, May 31, 2018 at 11:27:33AM +0900, Yasunori Goto wrote:
> >> >> Hello,
> >> >>
> >> >>
> >> >> I would like to know about the Experimental message of Filesystem DAX.
> >> >> --------------------------------------------------------
> >> >> DAX enabled. Warning: EXPERIMENTAL, use at your own risk
> >> >> --------------------------------------------------------
> >> >>
> >> >> AFAIK, the final issue of Filesystem DAX is metadata update problem,
> >> >> and it is(will be?) solved by great effort of MAP_SYNC and
> >> >> "fix dma vs truncate/hole-punch" patch set.
> >> >> So, I suppose that the Experimental message can be removed,
> >> >> but I'm not sure.
> >> >>
> >> >> Is it possible?
> >> >> Otherwise, are there any other issues in Filesystem DAX yet?
> >> >>
> >> >> If this is silly question, sorry for noise....
> >> >>
> >> >> Thanks,
> >> >> ---
> >> >> Yasunori Goto
> >> >
> >> > Adding in the XFS and ext4 developers, as it's really their call when to
> >> > remove this notice.
> >> >
> >> > We've talked about this off and on for a long while, but IMHO we should remove
> >> > the EXPERIMENTAL warning.  The last few things that we had on our TODO list
> >> > before this was removed were:
> >> >
> >> > 1) Get consistent handling of the DAX mount option.  We currently have this,
> >> > as both filesystems will behave the same and fall back and remove the DAX
> >> > mount option if it is unsupported by the block device, etc.
> >
> > <nod>
> >
> > As an aside, I wonder if Christoph's musings about "just have the kernel
> > determine the appropriate dax/non-dax setting from the acpi tables and
> > skip the inode flag entirely" ever got resolved?
> >
> >> > 2) Get consistent handling of the DAX inode option.  We currently have this,
> >> > as all DAX behavior now happens through the mount option.  If/when we
> >> > re-enable the per-inode DAX flag we should do it consistently for all DAX
> >> > enabled filesystems.
> >
> > The behavior of the inode flag isn't all that consistent.  ext4 doesn't
> > support it at all.  On XFS, you can set or clear FS_XFLAG_DAX on a
> > directory which will propagate the setting to any files created in that
> > directory.
> >
> > However, if you set or clear it on a file we update the on-disk inode
> > but we can't change the in-core state flag (S_DAX) until the next
> > in-core inode instantiation.  It's weird that users can change the flag
> > but the intended behavior changes won't happen until some ... time ...
> > in the future??
> >
> >> > 3) Make DAX work with other XFS features like reflink, etc.  This one isn't
> >> > done, but we at least disallow DAX with XFS features like reflink where it
> >> > could be an issue.  Darrick, do you still feel like we need to get these
> >> > working together to remove EXPERIMENTAL, or are you happy enough that we're
> >> > keeping them separated and that we're keeping user data safe?
> >
> > Yes, reflink and dax still need to work together.  I've not heard any
> > good arguments for why page sharing + copy on write are fundamentally
> > incompatible with the dax model, or why dax users will never, ever
> > require reflink.
> 
> Right, but that's separate from DAX being scream in your face
> "EXPERIMENTAL!". It's just an additional feature that can be added on
> once all the normal expectations of a userspace mapping work. I think
> reliable rmap is the last of those requirements.
> 
> > The recent thread between Jan and Dan make me wonder if making mappings
> > share struct pages is going to be a nightmare to add to the mm code,
> > though...
> 
> It's going to be a bit messy because a singular page->mapping
> association is fundamentally incompatible with DAX. Perhaps a linked
> list of mapping "siblings"?
> 
> > Also: ideally XFS would also be able to consume poison event
> > notifications from the pmem so that it can try to deal with metadata
> > loss, but that's probably a separate effort.
> 
> Right, not a gating item for declaring DAX ready for prime time.

Yep, I think that the very loud EXPERIMENTAL message is essentially telling
users "your data is at risk if you use this".  I totally agree that we still
have lots of work to do.  However, I don't think that these feature
enhancements should gate removal of the EXPERIMENTAL notice.   IMHO that
should only exist as long as we have issues that we know could corrupt data,
crash the box, etc.  As far as I know those are basically the 2 items on Dan's
list from a few mails ago (poison recovery & DMA vs truncate).

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Question about Experimental of Filesystem DAX.
@ 2018-05-31 20:25           ` Ross Zwisler
  0 siblings, 0 replies; 31+ messages in thread
From: Ross Zwisler @ 2018-05-31 20:25 UTC (permalink / raw)
  To: Dan Williams
  Cc: Darrick J. Wong, Ross Zwisler, Yasunori Goto, Jan Kara,
	linux-xfs, linux-ext4, NVDIMM-ML

On Thu, May 31, 2018 at 11:26:43AM -0700, Dan Williams wrote:
> On Thu, May 31, 2018 at 10:46 AM, Darrick J. Wong
> <darrick.wong@oracle.com> wrote:
> > On Thu, May 31, 2018 at 09:29:15AM -0700, Dan Williams wrote:
> >> On Thu, May 31, 2018 at 8:07 AM, Ross Zwisler
> >> <ross.zwisler@linux.intel.com> wrote:
> >> > On Thu, May 31, 2018 at 11:27:33AM +0900, Yasunori Goto wrote:
> >> >> Hello,
> >> >>
> >> >>
> >> >> I would like to know about the Experimental message of Filesystem DAX.
> >> >> --------------------------------------------------------
> >> >> DAX enabled. Warning: EXPERIMENTAL, use at your own risk
> >> >> --------------------------------------------------------
> >> >>
> >> >> AFAIK, the final issue of Filesystem DAX is metadata update problem,
> >> >> and it is(will be?) solved by great effort of MAP_SYNC and
> >> >> "fix dma vs truncate/hole-punch" patch set.
> >> >> So, I suppose that the Experimental message can be removed,
> >> >> but I'm not sure.
> >> >>
> >> >> Is it possible?
> >> >> Otherwise, are there any other issues in Filesystem DAX yet?
> >> >>
> >> >> If this is silly question, sorry for noise....
> >> >>
> >> >> Thanks,
> >> >> ---
> >> >> Yasunori Goto
> >> >
> >> > Adding in the XFS and ext4 developers, as it's really their call when to
> >> > remove this notice.
> >> >
> >> > We've talked about this off and on for a long while, but IMHO we should remove
> >> > the EXPERIMENTAL warning.  The last few things that we had on our TODO list
> >> > before this was removed were:
> >> >
> >> > 1) Get consistent handling of the DAX mount option.  We currently have this,
> >> > as both filesystems will behave the same and fall back and remove the DAX
> >> > mount option if it is unsupported by the block device, etc.
> >
> > <nod>
> >
> > As an aside, I wonder if Christoph's musings about "just have the kernel
> > determine the appropriate dax/non-dax setting from the acpi tables and
> > skip the inode flag entirely" ever got resolved?
> >
> >> > 2) Get consistent handling of the DAX inode option.  We currently have this,
> >> > as all DAX behavior now happens through the mount option.  If/when we
> >> > re-enable the per-inode DAX flag we should do it consistently for all DAX
> >> > enabled filesystems.
> >
> > The behavior of the inode flag isn't all that consistent.  ext4 doesn't
> > support it at all.  On XFS, you can set or clear FS_XFLAG_DAX on a
> > directory which will propagate the setting to any files created in that
> > directory.
> >
> > However, if you set or clear it on a file we update the on-disk inode
> > but we can't change the in-core state flag (S_DAX) until the next
> > in-core inode instantiation.  It's weird that users can change the flag
> > but the intended behavior changes won't happen until some ... time ...
> > in the future??
> >
> >> > 3) Make DAX work with other XFS features like reflink, etc.  This one isn't
> >> > done, but we at least disallow DAX with XFS features like reflink where it
> >> > could be an issue.  Darrick, do you still feel like we need to get these
> >> > working together to remove EXPERIMENTAL, or are you happy enough that we're
> >> > keeping them separated and that we're keeping user data safe?
> >
> > Yes, reflink and dax still need to work together.  I've not heard any
> > good arguments for why page sharing + copy on write are fundamentally
> > incompatible with the dax model, or why dax users will never, ever
> > require reflink.
> 
> Right, but that's separate from DAX being scream in your face
> "EXPERIMENTAL!". It's just an additional feature that can be added on
> once all the normal expectations of a userspace mapping work. I think
> reliable rmap is the last of those requirements.
> 
> > The recent thread between Jan and Dan make me wonder if making mappings
> > share struct pages is going to be a nightmare to add to the mm code,
> > though...
> 
> It's going to be a bit messy because a singular page->mapping
> association is fundamentally incompatible with DAX. Perhaps a linked
> list of mapping "siblings"?
> 
> > Also: ideally XFS would also be able to consume poison event
> > notifications from the pmem so that it can try to deal with metadata
> > loss, but that's probably a separate effort.
> 
> Right, not a gating item for declaring DAX ready for prime time.

Yep, I think that the very loud EXPERIMENTAL message is essentially telling
users "your data is at risk if you use this".  I totally agree that we still
have lots of work to do.  However, I don't think that these feature
enhancements should gate removal of the EXPERIMENTAL notice.   IMHO that
should only exist as long as we have issues that we know could corrupt data,
crash the box, etc.  As far as I know those are basically the 2 items on Dan's
list from a few mails ago (poison recovery & DMA vs truncate).

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Question about Experimental of Filesystem DAX.
@ 2018-05-31 23:05           ` Dave Chinner
  0 siblings, 0 replies; 31+ messages in thread
From: Dave Chinner @ 2018-05-31 23:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Darrick J. Wong, NVDIMM-ML, linux-xfs, Yasunori Goto,
	linux-ext4

On Thu, May 31, 2018 at 11:26:43AM -0700, Dan Williams wrote:
> On Thu, May 31, 2018 at 10:46 AM, Darrick J. Wong
> <darrick.wong@oracle.com> wrote:
> > On Thu, May 31, 2018 at 09:29:15AM -0700, Dan Williams wrote:
> >> On Thu, May 31, 2018 at 8:07 AM, Ross Zwisler
> >> <ross.zwisler@linux.intel.com> wrote:
> >> > On Thu, May 31, 2018 at 11:27:33AM +0900, Yasunori Goto wrote:
> >> >> Hello,
> >> >>
> >> >>
> >> >> I would like to know about the Experimental message of Filesystem DAX.
> >> >> --------------------------------------------------------
> >> >> DAX enabled. Warning: EXPERIMENTAL, use at your own risk
> >> >> --------------------------------------------------------
> >> >>
> >> >> AFAIK, the final issue of Filesystem DAX is metadata update problem,
> >> >> and it is(will be?) solved by great effort of MAP_SYNC and
> >> >> "fix dma vs truncate/hole-punch" patch set.
> >> >> So, I suppose that the Experimental message can be removed,
> >> >> but I'm not sure.
> >> >>
> >> >> Is it possible?
> >> >> Otherwise, are there any other issues in Filesystem DAX yet?
> >> >>
> >> >> If this is silly question, sorry for noise....
> >> >>
> >> >> Thanks,
> >> >> ---
> >> >> Yasunori Goto
> >> >
> >> > Adding in the XFS and ext4 developers, as it's really their call when to
> >> > remove this notice.
> >> >
> >> > We've talked about this off and on for a long while, but IMHO we should remove
> >> > the EXPERIMENTAL warning.  The last few things that we had on our TODO list
> >> > before this was removed were:
> >> >
> >> > 1) Get consistent handling of the DAX mount option.  We currently have this,
> >> > as both filesystems will behave the same and fall back and remove the DAX
> >> > mount option if it is unsupported by the block device, etc.
> >
> > <nod>
> >
> > As an aside, I wonder if Christoph's musings about "just have the kernel
> > determine the appropriate dax/non-dax setting from the acpi tables and
> > skip the inode flag entirely" ever got resolved?
> >
> >> > 2) Get consistent handling of the DAX inode option.  We currently have this,
> >> > as all DAX behavior now happens through the mount option.  If/when we
> >> > re-enable the per-inode DAX flag we should do it consistently for all DAX
> >> > enabled filesystems.
> >
> > The behavior of the inode flag isn't all that consistent.  ext4 doesn't
> > support it at all.  On XFS, you can set or clear FS_XFLAG_DAX on a
> > directory which will propagate the setting to any files created in that
> > directory.
> >
> > However, if you set or clear it on a file we update the on-disk inode
> > but we can't change the in-core state flag (S_DAX) until the next
> > in-core inode instantiation.  It's weird that users can change the flag
> > but the intended behavior changes won't happen until some ... time ...
> > in the future??
> >
> >> > 3) Make DAX work with other XFS features like reflink, etc.  This one isn't
> >> > done, but we at least disallow DAX with XFS features like reflink where it
> >> > could be an issue.  Darrick, do you still feel like we need to get these
> >> > working together to remove EXPERIMENTAL, or are you happy enough that we're
> >> > keeping them separated and that we're keeping user data safe?
> >
> > Yes, reflink and dax still need to work together.  I've not heard any
> > good arguments for why page sharing + copy on write are fundamentally
> > incompatible with the dax model, or why dax users will never, ever
> > require reflink.
> 
> Right, but that's separate from DAX being scream in your face
> "EXPERIMENTAL!". It's just an additional feature that can be added on
> once all the normal expectations of a userspace mapping work. I think
> reliable rmap is the last of those requirements.
> 
> > The recent thread between Jan and Dan make me wonder if making mappings
> > share struct pages is going to be a nightmare to add to the mm code,
> > though...
> 
> It's going to be a bit messy because a singular page->mapping
> association is fundamentally incompatible with DAX. Perhaps a linked
> list of mapping "siblings"?

I'd much prefer the filesystem allocate/control the struct page that
is inserted into mapping trees so we can have multiple struct pages
pointing at the one physical page.  That way we can just insert
these dynamic struct pages into the relevant mappings and it works
the same way for both DAX and shared page cache pages.

i.e. the filesystem knows they are shared physical blocks, the
filesystem controls COW of physical blocks, the filesystem controls
truncate/invalidation of physical blocks, the filesystem controls
cache state of the physical blocks. So why are we designing
infrastructure around the virtual memory and caching infrastructure
that bypasses the layer that manages and arbitrates access to the
physical storage?

This seems like we're well down the path of a architectural layering
violation that is backing us into a corner we're not going to be
able to get ourselves out of...

> > Also: ideally XFS would also be able to consume poison event
> > notifications from the pmem so that it can try to deal with metadata
> > loss, but that's probably a separate effort.

If the design is such that the layer that manages the physical
storage isn't going to be told about physical storage failures
before anyone else is informed, it would seem to me like we really
have introduced a major architectural flaw in DAX....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Question about Experimental of Filesystem DAX.
@ 2018-05-31 23:05           ` Dave Chinner
  0 siblings, 0 replies; 31+ messages in thread
From: Dave Chinner @ 2018-05-31 23:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Darrick J. Wong, NVDIMM-ML, linux-xfs, Yasunori Goto,
	linux-ext4

On Thu, May 31, 2018 at 11:26:43AM -0700, Dan Williams wrote:
> On Thu, May 31, 2018 at 10:46 AM, Darrick J. Wong
> <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> > On Thu, May 31, 2018 at 09:29:15AM -0700, Dan Williams wrote:
> >> On Thu, May 31, 2018 at 8:07 AM, Ross Zwisler
> >> <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> wrote:
> >> > On Thu, May 31, 2018 at 11:27:33AM +0900, Yasunori Goto wrote:
> >> >> Hello,
> >> >>
> >> >>
> >> >> I would like to know about the Experimental message of Filesystem DAX.
> >> >> --------------------------------------------------------
> >> >> DAX enabled. Warning: EXPERIMENTAL, use at your own risk
> >> >> --------------------------------------------------------
> >> >>
> >> >> AFAIK, the final issue of Filesystem DAX is metadata update problem,
> >> >> and it is(will be?) solved by great effort of MAP_SYNC and
> >> >> "fix dma vs truncate/hole-punch" patch set.
> >> >> So, I suppose that the Experimental message can be removed,
> >> >> but I'm not sure.
> >> >>
> >> >> Is it possible?
> >> >> Otherwise, are there any other issues in Filesystem DAX yet?
> >> >>
> >> >> If this is silly question, sorry for noise....
> >> >>
> >> >> Thanks,
> >> >> ---
> >> >> Yasunori Goto
> >> >
> >> > Adding in the XFS and ext4 developers, as it's really their call when to
> >> > remove this notice.
> >> >
> >> > We've talked about this off and on for a long while, but IMHO we should remove
> >> > the EXPERIMENTAL warning.  The last few things that we had on our TODO list
> >> > before this was removed were:
> >> >
> >> > 1) Get consistent handling of the DAX mount option.  We currently have this,
> >> > as both filesystems will behave the same and fall back and remove the DAX
> >> > mount option if it is unsupported by the block device, etc.
> >
> > <nod>
> >
> > As an aside, I wonder if Christoph's musings about "just have the kernel
> > determine the appropriate dax/non-dax setting from the acpi tables and
> > skip the inode flag entirely" ever got resolved?
> >
> >> > 2) Get consistent handling of the DAX inode option.  We currently have this,
> >> > as all DAX behavior now happens through the mount option.  If/when we
> >> > re-enable the per-inode DAX flag we should do it consistently for all DAX
> >> > enabled filesystems.
> >
> > The behavior of the inode flag isn't all that consistent.  ext4 doesn't
> > support it at all.  On XFS, you can set or clear FS_XFLAG_DAX on a
> > directory which will propagate the setting to any files created in that
> > directory.
> >
> > However, if you set or clear it on a file we update the on-disk inode
> > but we can't change the in-core state flag (S_DAX) until the next
> > in-core inode instantiation.  It's weird that users can change the flag
> > but the intended behavior changes won't happen until some ... time ...
> > in the future??
> >
> >> > 3) Make DAX work with other XFS features like reflink, etc.  This one isn't
> >> > done, but we at least disallow DAX with XFS features like reflink where it
> >> > could be an issue.  Darrick, do you still feel like we need to get these
> >> > working together to remove EXPERIMENTAL, or are you happy enough that we're
> >> > keeping them separated and that we're keeping user data safe?
> >
> > Yes, reflink and dax still need to work together.  I've not heard any
> > good arguments for why page sharing + copy on write are fundamentally
> > incompatible with the dax model, or why dax users will never, ever
> > require reflink.
> 
> Right, but that's separate from DAX being scream in your face
> "EXPERIMENTAL!". It's just an additional feature that can be added on
> once all the normal expectations of a userspace mapping work. I think
> reliable rmap is the last of those requirements.
> 
> > The recent thread between Jan and Dan make me wonder if making mappings
> > share struct pages is going to be a nightmare to add to the mm code,
> > though...
> 
> It's going to be a bit messy because a singular page->mapping
> association is fundamentally incompatible with DAX. Perhaps a linked
> list of mapping "siblings"?

I'd much prefer the filesystem allocate/control the struct page that
is inserted into mapping trees so we can have multiple struct pages
pointing at the one physical page.  That way we can just insert
these dynamic struct pages into the relevant mappings and it works
the same way for both DAX and shared page cache pages.

i.e. the filesystem knows they are shared physical blocks, the
filesystem controls COW of physical blocks, the filesystem controls
truncate/invalidation of physical blocks, the filesystem controls
cache state of the physical blocks. So why are we designing
infrastructure around the virtual memory and caching infrastructure
that bypasses the layer that manages and arbitrates access to the
physical storage?

This seems like we're well down the path of a architectural layering
violation that is backing us into a corner we're not going to be
able to get ourselves out of...

> > Also: ideally XFS would also be able to consume poison event
> > notifications from the pmem so that it can try to deal with metadata
> > loss, but that's probably a separate effort.

If the design is such that the layer that manages the physical
storage isn't going to be told about physical storage failures
before anyone else is informed, it would seem to me like we really
have introduced a major architectural flaw in DAX....

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Question about Experimental of Filesystem DAX.
@ 2018-05-31 23:05           ` Dave Chinner
  0 siblings, 0 replies; 31+ messages in thread
From: Dave Chinner @ 2018-05-31 23:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: Darrick J. Wong, Ross Zwisler, Yasunori Goto, Jan Kara,
	linux-xfs, linux-ext4, NVDIMM-ML

On Thu, May 31, 2018 at 11:26:43AM -0700, Dan Williams wrote:
> On Thu, May 31, 2018 at 10:46 AM, Darrick J. Wong
> <darrick.wong@oracle.com> wrote:
> > On Thu, May 31, 2018 at 09:29:15AM -0700, Dan Williams wrote:
> >> On Thu, May 31, 2018 at 8:07 AM, Ross Zwisler
> >> <ross.zwisler@linux.intel.com> wrote:
> >> > On Thu, May 31, 2018 at 11:27:33AM +0900, Yasunori Goto wrote:
> >> >> Hello,
> >> >>
> >> >>
> >> >> I would like to know about the Experimental message of Filesystem DAX.
> >> >> --------------------------------------------------------
> >> >> DAX enabled. Warning: EXPERIMENTAL, use at your own risk
> >> >> --------------------------------------------------------
> >> >>
> >> >> AFAIK, the final issue of Filesystem DAX is metadata update problem,
> >> >> and it is(will be?) solved by great effort of MAP_SYNC and
> >> >> "fix dma vs truncate/hole-punch" patch set.
> >> >> So, I suppose that the Experimental message can be removed,
> >> >> but I'm not sure.
> >> >>
> >> >> Is it possible?
> >> >> Otherwise, are there any other issues in Filesystem DAX yet?
> >> >>
> >> >> If this is silly question, sorry for noise....
> >> >>
> >> >> Thanks,
> >> >> ---
> >> >> Yasunori Goto
> >> >
> >> > Adding in the XFS and ext4 developers, as it's really their call when to
> >> > remove this notice.
> >> >
> >> > We've talked about this off and on for a long while, but IMHO we should remove
> >> > the EXPERIMENTAL warning.  The last few things that we had on our TODO list
> >> > before this was removed were:
> >> >
> >> > 1) Get consistent handling of the DAX mount option.  We currently have this,
> >> > as both filesystems will behave the same and fall back and remove the DAX
> >> > mount option if it is unsupported by the block device, etc.
> >
> > <nod>
> >
> > As an aside, I wonder if Christoph's musings about "just have the kernel
> > determine the appropriate dax/non-dax setting from the acpi tables and
> > skip the inode flag entirely" ever got resolved?
> >
> >> > 2) Get consistent handling of the DAX inode option.  We currently have this,
> >> > as all DAX behavior now happens through the mount option.  If/when we
> >> > re-enable the per-inode DAX flag we should do it consistently for all DAX
> >> > enabled filesystems.
> >
> > The behavior of the inode flag isn't all that consistent.  ext4 doesn't
> > support it at all.  On XFS, you can set or clear FS_XFLAG_DAX on a
> > directory which will propagate the setting to any files created in that
> > directory.
> >
> > However, if you set or clear it on a file we update the on-disk inode
> > but we can't change the in-core state flag (S_DAX) until the next
> > in-core inode instantiation.  It's weird that users can change the flag
> > but the intended behavior changes won't happen until some ... time ...
> > in the future??
> >
> >> > 3) Make DAX work with other XFS features like reflink, etc.  This one isn't
> >> > done, but we at least disallow DAX with XFS features like reflink where it
> >> > could be an issue.  Darrick, do you still feel like we need to get these
> >> > working together to remove EXPERIMENTAL, or are you happy enough that we're
> >> > keeping them separated and that we're keeping user data safe?
> >
> > Yes, reflink and dax still need to work together.  I've not heard any
> > good arguments for why page sharing + copy on write are fundamentally
> > incompatible with the dax model, or why dax users will never, ever
> > require reflink.
> 
> Right, but that's separate from DAX being scream in your face
> "EXPERIMENTAL!". It's just an additional feature that can be added on
> once all the normal expectations of a userspace mapping work. I think
> reliable rmap is the last of those requirements.
> 
> > The recent thread between Jan and Dan make me wonder if making mappings
> > share struct pages is going to be a nightmare to add to the mm code,
> > though...
> 
> It's going to be a bit messy because a singular page->mapping
> association is fundamentally incompatible with DAX. Perhaps a linked
> list of mapping "siblings"?

I'd much prefer the filesystem allocate/control the struct page that
is inserted into mapping trees so we can have multiple struct pages
pointing at the one physical page.  That way we can just insert
these dynamic struct pages into the relevant mappings and it works
the same way for both DAX and shared page cache pages.

i.e. the filesystem knows they are shared physical blocks, the
filesystem controls COW of physical blocks, the filesystem controls
truncate/invalidation of physical blocks, the filesystem controls
cache state of the physical blocks. So why are we designing
infrastructure around the virtual memory and caching infrastructure
that bypasses the layer that manages and arbitrates access to the
physical storage?

This seems like we're well down the path of a architectural layering
violation that is backing us into a corner we're not going to be
able to get ourselves out of...

> > Also: ideally XFS would also be able to consume poison event
> > notifications from the pmem so that it can try to deal with metadata
> > loss, but that's probably a separate effort.

If the design is such that the layer that manages the physical
storage isn't going to be told about physical storage failures
before anyone else is informed, it would seem to me like we really
have introduced a major architectural flaw in DAX....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Question about Experimental of Filesystem DAX.
  2018-05-31 23:05           ` Dave Chinner
  (?)
@ 2018-06-01  1:03             ` Dan Williams
  -1 siblings, 0 replies; 31+ messages in thread
From: Dan Williams @ 2018-06-01  1:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Darrick J. Wong, NVDIMM-ML, linux-xfs, Yasunori Goto,
	linux-ext4

On Thu, May 31, 2018 at 4:05 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, May 31, 2018 at 11:26:43AM -0700, Dan Williams wrote:
>> On Thu, May 31, 2018 at 10:46 AM, Darrick J. Wong
>> <darrick.wong@oracle.com> wrote:
>> > On Thu, May 31, 2018 at 09:29:15AM -0700, Dan Williams wrote:
>> >> On Thu, May 31, 2018 at 8:07 AM, Ross Zwisler
>> >> <ross.zwisler@linux.intel.com> wrote:
>> >> > On Thu, May 31, 2018 at 11:27:33AM +0900, Yasunori Goto wrote:
>> >> >> Hello,
>> >> >>
>> >> >>
>> >> >> I would like to know about the Experimental message of Filesystem DAX.
>> >> >> --------------------------------------------------------
>> >> >> DAX enabled. Warning: EXPERIMENTAL, use at your own risk
>> >> >> --------------------------------------------------------
>> >> >>
>> >> >> AFAIK, the final issue of Filesystem DAX is metadata update problem,
>> >> >> and it is(will be?) solved by great effort of MAP_SYNC and
>> >> >> "fix dma vs truncate/hole-punch" patch set.
>> >> >> So, I suppose that the Experimental message can be removed,
>> >> >> but I'm not sure.
>> >> >>
>> >> >> Is it possible?
>> >> >> Otherwise, are there any other issues in Filesystem DAX yet?
>> >> >>
>> >> >> If this is silly question, sorry for noise....
>> >> >>
>> >> >> Thanks,
>> >> >> ---
>> >> >> Yasunori Goto
>> >> >
>> >> > Adding in the XFS and ext4 developers, as it's really their call when to
>> >> > remove this notice.
>> >> >
>> >> > We've talked about this off and on for a long while, but IMHO we should remove
>> >> > the EXPERIMENTAL warning.  The last few things that we had on our TODO list
>> >> > before this was removed were:
>> >> >
>> >> > 1) Get consistent handling of the DAX mount option.  We currently have this,
>> >> > as both filesystems will behave the same and fall back and remove the DAX
>> >> > mount option if it is unsupported by the block device, etc.
>> >
>> > <nod>
>> >
>> > As an aside, I wonder if Christoph's musings about "just have the kernel
>> > determine the appropriate dax/non-dax setting from the acpi tables and
>> > skip the inode flag entirely" ever got resolved?
>> >
>> >> > 2) Get consistent handling of the DAX inode option.  We currently have this,
>> >> > as all DAX behavior now happens through the mount option.  If/when we
>> >> > re-enable the per-inode DAX flag we should do it consistently for all DAX
>> >> > enabled filesystems.
>> >
>> > The behavior of the inode flag isn't all that consistent.  ext4 doesn't
>> > support it at all.  On XFS, you can set or clear FS_XFLAG_DAX on a
>> > directory which will propagate the setting to any files created in that
>> > directory.
>> >
>> > However, if you set or clear it on a file we update the on-disk inode
>> > but we can't change the in-core state flag (S_DAX) until the next
>> > in-core inode instantiation.  It's weird that users can change the flag
>> > but the intended behavior changes won't happen until some ... time ...
>> > in the future??
>> >
>> >> > 3) Make DAX work with other XFS features like reflink, etc.  This one isn't
>> >> > done, but we at least disallow DAX with XFS features like reflink where it
>> >> > could be an issue.  Darrick, do you still feel like we need to get these
>> >> > working together to remove EXPERIMENTAL, or are you happy enough that we're
>> >> > keeping them separated and that we're keeping user data safe?
>> >
>> > Yes, reflink and dax still need to work together.  I've not heard any
>> > good arguments for why page sharing + copy on write are fundamentally
>> > incompatible with the dax model, or why dax users will never, ever
>> > require reflink.
>>
>> Right, but that's separate from DAX being scream in your face
>> "EXPERIMENTAL!". It's just an additional feature that can be added on
>> once all the normal expectations of a userspace mapping work. I think
>> reliable rmap is the last of those requirements.
>>
>> > The recent thread between Jan and Dan make me wonder if making mappings
>> > share struct pages is going to be a nightmare to add to the mm code,
>> > though...
>>
>> It's going to be a bit messy because a singular page->mapping
>> association is fundamentally incompatible with DAX. Perhaps a linked
>> list of mapping "siblings"?
>
> I'd much prefer the filesystem allocate/control the struct page that
> is inserted into mapping trees so we can have multiple struct pages
> pointing at the one physical page.  That way we can just insert
> these dynamic struct pages into the relevant mappings and it works
> the same way for both DAX and shared page cache pages.

How would that work when there is a 1:1 pfn-to-page and
file-block-to-pfn relationship?

> i.e. the filesystem knows they are shared physical blocks, the
> filesystem controls COW of physical blocks, the filesystem controls
> truncate/invalidation of physical blocks, the filesystem controls
> cache state of the physical blocks. So why are we designing
> infrastructure around the virtual memory and caching infrastructure
> that bypasses the layer that manages and arbitrates access to the
> physical storage?

Yes, because DAX broke the vm's assumptions that pages are not
physical storage blocks.

> This seems like we're well down the path of a architectural layering
> violation that is backing us into a corner we're not going to be
> able to get ourselves out of...

I think it is solvable by teaching the vm more about dax pages and
having it call back into the filesytem for some of these operations.

>> > Also: ideally XFS would also be able to consume poison event
>> > notifications from the pmem so that it can try to deal with metadata
>> > loss, but that's probably a separate effort.
>
> If the design is such that the layer that manages the physical
> storage isn't going to be told about physical storage failures
> before anyone else is informed, it would seem to me like we really
> have introduced a major architectural flaw in DAX....

It would be trivial to hook these notifications into the filesystem.
This is something we've had on the backlog for a long time to circle
back and address. This has been waiting for the xfs reverse-map work
to settle and for one of us pmem developers to free up and do the
work.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Question about Experimental of Filesystem DAX.
@ 2018-06-01  1:03             ` Dan Williams
  0 siblings, 0 replies; 31+ messages in thread
From: Dan Williams @ 2018-06-01  1:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Darrick J. Wong, NVDIMM-ML, linux-xfs, Yasunori Goto,
	linux-ext4

On Thu, May 31, 2018 at 4:05 PM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote:
> On Thu, May 31, 2018 at 11:26:43AM -0700, Dan Williams wrote:
>> On Thu, May 31, 2018 at 10:46 AM, Darrick J. Wong
>> <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
>> > On Thu, May 31, 2018 at 09:29:15AM -0700, Dan Williams wrote:
>> >> On Thu, May 31, 2018 at 8:07 AM, Ross Zwisler
>> >> <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> wrote:
>> >> > On Thu, May 31, 2018 at 11:27:33AM +0900, Yasunori Goto wrote:
>> >> >> Hello,
>> >> >>
>> >> >>
>> >> >> I would like to know about the Experimental message of Filesystem DAX.
>> >> >> --------------------------------------------------------
>> >> >> DAX enabled. Warning: EXPERIMENTAL, use at your own risk
>> >> >> --------------------------------------------------------
>> >> >>
>> >> >> AFAIK, the final issue of Filesystem DAX is metadata update problem,
>> >> >> and it is(will be?) solved by great effort of MAP_SYNC and
>> >> >> "fix dma vs truncate/hole-punch" patch set.
>> >> >> So, I suppose that the Experimental message can be removed,
>> >> >> but I'm not sure.
>> >> >>
>> >> >> Is it possible?
>> >> >> Otherwise, are there any other issues in Filesystem DAX yet?
>> >> >>
>> >> >> If this is silly question, sorry for noise....
>> >> >>
>> >> >> Thanks,
>> >> >> ---
>> >> >> Yasunori Goto
>> >> >
>> >> > Adding in the XFS and ext4 developers, as it's really their call when to
>> >> > remove this notice.
>> >> >
>> >> > We've talked about this off and on for a long while, but IMHO we should remove
>> >> > the EXPERIMENTAL warning.  The last few things that we had on our TODO list
>> >> > before this was removed were:
>> >> >
>> >> > 1) Get consistent handling of the DAX mount option.  We currently have this,
>> >> > as both filesystems will behave the same and fall back and remove the DAX
>> >> > mount option if it is unsupported by the block device, etc.
>> >
>> > <nod>
>> >
>> > As an aside, I wonder if Christoph's musings about "just have the kernel
>> > determine the appropriate dax/non-dax setting from the acpi tables and
>> > skip the inode flag entirely" ever got resolved?
>> >
>> >> > 2) Get consistent handling of the DAX inode option.  We currently have this,
>> >> > as all DAX behavior now happens through the mount option.  If/when we
>> >> > re-enable the per-inode DAX flag we should do it consistently for all DAX
>> >> > enabled filesystems.
>> >
>> > The behavior of the inode flag isn't all that consistent.  ext4 doesn't
>> > support it at all.  On XFS, you can set or clear FS_XFLAG_DAX on a
>> > directory which will propagate the setting to any files created in that
>> > directory.
>> >
>> > However, if you set or clear it on a file we update the on-disk inode
>> > but we can't change the in-core state flag (S_DAX) until the next
>> > in-core inode instantiation.  It's weird that users can change the flag
>> > but the intended behavior changes won't happen until some ... time ...
>> > in the future??
>> >
>> >> > 3) Make DAX work with other XFS features like reflink, etc.  This one isn't
>> >> > done, but we at least disallow DAX with XFS features like reflink where it
>> >> > could be an issue.  Darrick, do you still feel like we need to get these
>> >> > working together to remove EXPERIMENTAL, or are you happy enough that we're
>> >> > keeping them separated and that we're keeping user data safe?
>> >
>> > Yes, reflink and dax still need to work together.  I've not heard any
>> > good arguments for why page sharing + copy on write are fundamentally
>> > incompatible with the dax model, or why dax users will never, ever
>> > require reflink.
>>
>> Right, but that's separate from DAX being scream in your face
>> "EXPERIMENTAL!". It's just an additional feature that can be added on
>> once all the normal expectations of a userspace mapping work. I think
>> reliable rmap is the last of those requirements.
>>
>> > The recent thread between Jan and Dan make me wonder if making mappings
>> > share struct pages is going to be a nightmare to add to the mm code,
>> > though...
>>
>> It's going to be a bit messy because a singular page->mapping
>> association is fundamentally incompatible with DAX. Perhaps a linked
>> list of mapping "siblings"?
>
> I'd much prefer the filesystem allocate/control the struct page that
> is inserted into mapping trees so we can have multiple struct pages
> pointing at the one physical page.  That way we can just insert
> these dynamic struct pages into the relevant mappings and it works
> the same way for both DAX and shared page cache pages.

How would that work when there is a 1:1 pfn-to-page and
file-block-to-pfn relationship?

> i.e. the filesystem knows they are shared physical blocks, the
> filesystem controls COW of physical blocks, the filesystem controls
> truncate/invalidation of physical blocks, the filesystem controls
> cache state of the physical blocks. So why are we designing
> infrastructure around the virtual memory and caching infrastructure
> that bypasses the layer that manages and arbitrates access to the
> physical storage?

Yes, because DAX broke the vm's assumptions that pages are not
physical storage blocks.

> This seems like we're well down the path of a architectural layering
> violation that is backing us into a corner we're not going to be
> able to get ourselves out of...

I think it is solvable by teaching the vm more about dax pages and
having it call back into the filesytem for some of these operations.

>> > Also: ideally XFS would also be able to consume poison event
>> > notifications from the pmem so that it can try to deal with metadata
>> > loss, but that's probably a separate effort.
>
> If the design is such that the layer that manages the physical
> storage isn't going to be told about physical storage failures
> before anyone else is informed, it would seem to me like we really
> have introduced a major architectural flaw in DAX....

It would be trivial to hook these notifications into the filesystem.
This is something we've had on the backlog for a long time to circle
back and address. This has been waiting for the xfs reverse-map work
to settle and for one of us pmem developers to free up and do the
work.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Question about Experimental of Filesystem DAX.
@ 2018-06-01  1:03             ` Dan Williams
  0 siblings, 0 replies; 31+ messages in thread
From: Dan Williams @ 2018-06-01  1:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Ross Zwisler, Yasunori Goto, Jan Kara,
	linux-xfs, linux-ext4, NVDIMM-ML

On Thu, May 31, 2018 at 4:05 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, May 31, 2018 at 11:26:43AM -0700, Dan Williams wrote:
>> On Thu, May 31, 2018 at 10:46 AM, Darrick J. Wong
>> <darrick.wong@oracle.com> wrote:
>> > On Thu, May 31, 2018 at 09:29:15AM -0700, Dan Williams wrote:
>> >> On Thu, May 31, 2018 at 8:07 AM, Ross Zwisler
>> >> <ross.zwisler@linux.intel.com> wrote:
>> >> > On Thu, May 31, 2018 at 11:27:33AM +0900, Yasunori Goto wrote:
>> >> >> Hello,
>> >> >>
>> >> >>
>> >> >> I would like to know about the Experimental message of Filesystem DAX.
>> >> >> --------------------------------------------------------
>> >> >> DAX enabled. Warning: EXPERIMENTAL, use at your own risk
>> >> >> --------------------------------------------------------
>> >> >>
>> >> >> AFAIK, the final issue of Filesystem DAX is metadata update problem,
>> >> >> and it is(will be?) solved by great effort of MAP_SYNC and
>> >> >> "fix dma vs truncate/hole-punch" patch set.
>> >> >> So, I suppose that the Experimental message can be removed,
>> >> >> but I'm not sure.
>> >> >>
>> >> >> Is it possible?
>> >> >> Otherwise, are there any other issues in Filesystem DAX yet?
>> >> >>
>> >> >> If this is silly question, sorry for noise....
>> >> >>
>> >> >> Thanks,
>> >> >> ---
>> >> >> Yasunori Goto
>> >> >
>> >> > Adding in the XFS and ext4 developers, as it's really their call when to
>> >> > remove this notice.
>> >> >
>> >> > We've talked about this off and on for a long while, but IMHO we should remove
>> >> > the EXPERIMENTAL warning.  The last few things that we had on our TODO list
>> >> > before this was removed were:
>> >> >
>> >> > 1) Get consistent handling of the DAX mount option.  We currently have this,
>> >> > as both filesystems will behave the same and fall back and remove the DAX
>> >> > mount option if it is unsupported by the block device, etc.
>> >
>> > <nod>
>> >
>> > As an aside, I wonder if Christoph's musings about "just have the kernel
>> > determine the appropriate dax/non-dax setting from the acpi tables and
>> > skip the inode flag entirely" ever got resolved?
>> >
>> >> > 2) Get consistent handling of the DAX inode option.  We currently have this,
>> >> > as all DAX behavior now happens through the mount option.  If/when we
>> >> > re-enable the per-inode DAX flag we should do it consistently for all DAX
>> >> > enabled filesystems.
>> >
>> > The behavior of the inode flag isn't all that consistent.  ext4 doesn't
>> > support it at all.  On XFS, you can set or clear FS_XFLAG_DAX on a
>> > directory which will propagate the setting to any files created in that
>> > directory.
>> >
>> > However, if you set or clear it on a file we update the on-disk inode
>> > but we can't change the in-core state flag (S_DAX) until the next
>> > in-core inode instantiation.  It's weird that users can change the flag
>> > but the intended behavior changes won't happen until some ... time ...
>> > in the future??
>> >
>> >> > 3) Make DAX work with other XFS features like reflink, etc.  This one isn't
>> >> > done, but we at least disallow DAX with XFS features like reflink where it
>> >> > could be an issue.  Darrick, do you still feel like we need to get these
>> >> > working together to remove EXPERIMENTAL, or are you happy enough that we're
>> >> > keeping them separated and that we're keeping user data safe?
>> >
>> > Yes, reflink and dax still need to work together.  I've not heard any
>> > good arguments for why page sharing + copy on write are fundamentally
>> > incompatible with the dax model, or why dax users will never, ever
>> > require reflink.
>>
>> Right, but that's separate from DAX being scream in your face
>> "EXPERIMENTAL!". It's just an additional feature that can be added on
>> once all the normal expectations of a userspace mapping work. I think
>> reliable rmap is the last of those requirements.
>>
>> > The recent thread between Jan and Dan make me wonder if making mappings
>> > share struct pages is going to be a nightmare to add to the mm code,
>> > though...
>>
>> It's going to be a bit messy because a singular page->mapping
>> association is fundamentally incompatible with DAX. Perhaps a linked
>> list of mapping "siblings"?
>
> I'd much prefer the filesystem allocate/control the struct page that
> is inserted into mapping trees so we can have multiple struct pages
> pointing at the one physical page.  That way we can just insert
> these dynamic struct pages into the relevant mappings and it works
> the same way for both DAX and shared page cache pages.

How would that work when there is a 1:1 pfn-to-page and
file-block-to-pfn relationship?

> i.e. the filesystem knows they are shared physical blocks, the
> filesystem controls COW of physical blocks, the filesystem controls
> truncate/invalidation of physical blocks, the filesystem controls
> cache state of the physical blocks. So why are we designing
> infrastructure around the virtual memory and caching infrastructure
> that bypasses the layer that manages and arbitrates access to the
> physical storage?

Yes, because DAX broke the vm's assumptions that pages are not
physical storage blocks.

> This seems like we're well down the path of a architectural layering
> violation that is backing us into a corner we're not going to be
> able to get ourselves out of...

I think it is solvable by teaching the vm more about dax pages and
having it call back into the filesytem for some of these operations.

>> > Also: ideally XFS would also be able to consume poison event
>> > notifications from the pmem so that it can try to deal with metadata
>> > loss, but that's probably a separate effort.
>
> If the design is such that the layer that manages the physical
> storage isn't going to be told about physical storage failures
> before anyone else is informed, it would seem to me like we really
> have introduced a major architectural flaw in DAX....

It would be trivial to hook these notifications into the filesystem.
This is something we've had on the backlog for a long time to circle
back and address. This has been waiting for the xfs reverse-map work
to settle and for one of us pmem developers to free up and do the
work.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Question about Experimental of Filesystem DAX.
@ 2018-06-04  1:44             ` Yasunori Goto
  0 siblings, 0 replies; 31+ messages in thread
From: Yasunori Goto @ 2018-06-04  1:44 UTC (permalink / raw)
  To: Ross Zwisler; +Cc: Jan Kara, NVDIMM-ML, Darrick J. Wong, linux-xfs, linux-ext4

> On Thu, May 31, 2018 at 11:26:43AM -0700, Dan Williams wrote:
> > On Thu, May 31, 2018 at 10:46 AM, Darrick J. Wong
> > <darrick.wong@oracle.com> wrote:
> > > On Thu, May 31, 2018 at 09:29:15AM -0700, Dan Williams wrote:
> > >> On Thu, May 31, 2018 at 8:07 AM, Ross Zwisler
> > >> <ross.zwisler@linux.intel.com> wrote:
> > >> > On Thu, May 31, 2018 at 11:27:33AM +0900, Yasunori Goto wrote:
> > >> >> Hello,
> > >> >>
> > >> >>
> > >> >> I would like to know about the Experimental message of Filesystem DAX.
> > >> >> --------------------------------------------------------
> > >> >> DAX enabled. Warning: EXPERIMENTAL, use at your own risk
> > >> >> --------------------------------------------------------
> > >> >>
> > >> >> AFAIK, the final issue of Filesystem DAX is metadata update problem,
> > >> >> and it is(will be?) solved by great effort of MAP_SYNC and
> > >> >> "fix dma vs truncate/hole-punch" patch set.
> > >> >> So, I suppose that the Experimental message can be removed,
> > >> >> but I'm not sure.
> > >> >>
> > >> >> Is it possible?
> > >> >> Otherwise, are there any other issues in Filesystem DAX yet?
> > >> >>
> > >> >> If this is silly question, sorry for noise....
> > >> >>
> > >> >> Thanks,
> > >> >> ---
> > >> >> Yasunori Goto
> > >> >
> > >> > Adding in the XFS and ext4 developers, as it's really their call when to
> > >> > remove this notice.
> > >> >
> > >> > We've talked about this off and on for a long while, but IMHO we should remove
> > >> > the EXPERIMENTAL warning.  The last few things that we had on our TODO list
> > >> > before this was removed were:
> > >> >
> > >> > 1) Get consistent handling of the DAX mount option.  We currently have this,
> > >> > as both filesystems will behave the same and fall back and remove the DAX
> > >> > mount option if it is unsupported by the block device, etc.
> > >
> > > <nod>
> > >
> > > As an aside, I wonder if Christoph's musings about "just have the kernel
> > > determine the appropriate dax/non-dax setting from the acpi tables and
> > > skip the inode flag entirely" ever got resolved?
> > >
> > >> > 2) Get consistent handling of the DAX inode option.  We currently have this,
> > >> > as all DAX behavior now happens through the mount option.  If/when we
> > >> > re-enable the per-inode DAX flag we should do it consistently for all DAX
> > >> > enabled filesystems.
> > >
> > > The behavior of the inode flag isn't all that consistent.  ext4 doesn't
> > > support it at all.  On XFS, you can set or clear FS_XFLAG_DAX on a
> > > directory which will propagate the setting to any files created in that
> > > directory.
> > >
> > > However, if you set or clear it on a file we update the on-disk inode
> > > but we can't change the in-core state flag (S_DAX) until the next
> > > in-core inode instantiation.  It's weird that users can change the flag
> > > but the intended behavior changes won't happen until some ... time ...
> > > in the future??
> > >
> > >> > 3) Make DAX work with other XFS features like reflink, etc.  This one isn't
> > >> > done, but we at least disallow DAX with XFS features like reflink where it
> > >> > could be an issue.  Darrick, do you still feel like we need to get these
> > >> > working together to remove EXPERIMENTAL, or are you happy enough that we're
> > >> > keeping them separated and that we're keeping user data safe?
> > >
> > > Yes, reflink and dax still need to work together.  I've not heard any
> > > good arguments for why page sharing + copy on write are fundamentally
> > > incompatible with the dax model, or why dax users will never, ever
> > > require reflink.
> > 
> > Right, but that's separate from DAX being scream in your face
> > "EXPERIMENTAL!". It's just an additional feature that can be added on
> > once all the normal expectations of a userspace mapping work. I think
> > reliable rmap is the last of those requirements.
> > 
> > > The recent thread between Jan and Dan make me wonder if making mappings
> > > share struct pages is going to be a nightmare to add to the mm code,
> > > though...
> > 
> > It's going to be a bit messy because a singular page->mapping
> > association is fundamentally incompatible with DAX. Perhaps a linked
> > list of mapping "siblings"?
> > 
> > > Also: ideally XFS would also be able to consume poison event
> > > notifications from the pmem so that it can try to deal with metadata
> > > loss, but that's probably a separate effort.
> > 
> > Right, not a gating item for declaring DAX ready for prime time.
> 
> Yep, I think that the very loud EXPERIMENTAL message is essentially telling
> users "your data is at risk if you use this".  I totally agree that we still
> have lots of work to do.  However, I don't think that these feature
> enhancements should gate removal of the EXPERIMENTAL notice.   IMHO that
> should only exist as long as we have issues that we know could corrupt data,
> crash the box, etc.  As far as I know those are basically the 2 items on Dan's
> list from a few mails ago (poison recovery & DMA vs truncate).

Everyone,

Thank you very much for your information/opinions.
Not only about "experimental", I could understand what is still to do.

Thanks a lot!
---
Yasunori Goto



_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Question about Experimental of Filesystem DAX.
@ 2018-06-04  1:44             ` Yasunori Goto
  0 siblings, 0 replies; 31+ messages in thread
From: Yasunori Goto @ 2018-06-04  1:44 UTC (permalink / raw)
  To: Ross Zwisler; +Cc: Jan Kara, NVDIMM-ML, Darrick J. Wong, linux-xfs, linux-ext4

> On Thu, May 31, 2018 at 11:26:43AM -0700, Dan Williams wrote:
> > On Thu, May 31, 2018 at 10:46 AM, Darrick J. Wong
> > <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> > > On Thu, May 31, 2018 at 09:29:15AM -0700, Dan Williams wrote:
> > >> On Thu, May 31, 2018 at 8:07 AM, Ross Zwisler
> > >> <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> wrote:
> > >> > On Thu, May 31, 2018 at 11:27:33AM +0900, Yasunori Goto wrote:
> > >> >> Hello,
> > >> >>
> > >> >>
> > >> >> I would like to know about the Experimental message of Filesystem DAX.
> > >> >> --------------------------------------------------------
> > >> >> DAX enabled. Warning: EXPERIMENTAL, use at your own risk
> > >> >> --------------------------------------------------------
> > >> >>
> > >> >> AFAIK, the final issue of Filesystem DAX is metadata update problem,
> > >> >> and it is(will be?) solved by great effort of MAP_SYNC and
> > >> >> "fix dma vs truncate/hole-punch" patch set.
> > >> >> So, I suppose that the Experimental message can be removed,
> > >> >> but I'm not sure.
> > >> >>
> > >> >> Is it possible?
> > >> >> Otherwise, are there any other issues in Filesystem DAX yet?
> > >> >>
> > >> >> If this is silly question, sorry for noise....
> > >> >>
> > >> >> Thanks,
> > >> >> ---
> > >> >> Yasunori Goto
> > >> >
> > >> > Adding in the XFS and ext4 developers, as it's really their call when to
> > >> > remove this notice.
> > >> >
> > >> > We've talked about this off and on for a long while, but IMHO we should remove
> > >> > the EXPERIMENTAL warning.  The last few things that we had on our TODO list
> > >> > before this was removed were:
> > >> >
> > >> > 1) Get consistent handling of the DAX mount option.  We currently have this,
> > >> > as both filesystems will behave the same and fall back and remove the DAX
> > >> > mount option if it is unsupported by the block device, etc.
> > >
> > > <nod>
> > >
> > > As an aside, I wonder if Christoph's musings about "just have the kernel
> > > determine the appropriate dax/non-dax setting from the acpi tables and
> > > skip the inode flag entirely" ever got resolved?
> > >
> > >> > 2) Get consistent handling of the DAX inode option.  We currently have this,
> > >> > as all DAX behavior now happens through the mount option.  If/when we
> > >> > re-enable the per-inode DAX flag we should do it consistently for all DAX
> > >> > enabled filesystems.
> > >
> > > The behavior of the inode flag isn't all that consistent.  ext4 doesn't
> > > support it at all.  On XFS, you can set or clear FS_XFLAG_DAX on a
> > > directory which will propagate the setting to any files created in that
> > > directory.
> > >
> > > However, if you set or clear it on a file we update the on-disk inode
> > > but we can't change the in-core state flag (S_DAX) until the next
> > > in-core inode instantiation.  It's weird that users can change the flag
> > > but the intended behavior changes won't happen until some ... time ...
> > > in the future??
> > >
> > >> > 3) Make DAX work with other XFS features like reflink, etc.  This one isn't
> > >> > done, but we at least disallow DAX with XFS features like reflink where it
> > >> > could be an issue.  Darrick, do you still feel like we need to get these
> > >> > working together to remove EXPERIMENTAL, or are you happy enough that we're
> > >> > keeping them separated and that we're keeping user data safe?
> > >
> > > Yes, reflink and dax still need to work together.  I've not heard any
> > > good arguments for why page sharing + copy on write are fundamentally
> > > incompatible with the dax model, or why dax users will never, ever
> > > require reflink.
> > 
> > Right, but that's separate from DAX being scream in your face
> > "EXPERIMENTAL!". It's just an additional feature that can be added on
> > once all the normal expectations of a userspace mapping work. I think
> > reliable rmap is the last of those requirements.
> > 
> > > The recent thread between Jan and Dan make me wonder if making mappings
> > > share struct pages is going to be a nightmare to add to the mm code,
> > > though...
> > 
> > It's going to be a bit messy because a singular page->mapping
> > association is fundamentally incompatible with DAX. Perhaps a linked
> > list of mapping "siblings"?
> > 
> > > Also: ideally XFS would also be able to consume poison event
> > > notifications from the pmem so that it can try to deal with metadata
> > > loss, but that's probably a separate effort.
> > 
> > Right, not a gating item for declaring DAX ready for prime time.
> 
> Yep, I think that the very loud EXPERIMENTAL message is essentially telling
> users "your data is at risk if you use this".  I totally agree that we still
> have lots of work to do.  However, I don't think that these feature
> enhancements should gate removal of the EXPERIMENTAL notice.   IMHO that
> should only exist as long as we have issues that we know could corrupt data,
> crash the box, etc.  As far as I know those are basically the 2 items on Dan's
> list from a few mails ago (poison recovery & DMA vs truncate).

Everyone,

Thank you very much for your information/opinions.
Not only about "experimental", I could understand what is still to do.

Thanks a lot!
---
Yasunori Goto

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Question about Experimental of Filesystem DAX.
@ 2018-06-04  1:44             ` Yasunori Goto
  0 siblings, 0 replies; 31+ messages in thread
From: Yasunori Goto @ 2018-06-04  1:44 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Dan Williams, Jan Kara, Darrick J. Wong, NVDIMM-ML, linux-xfs,
	linux-ext4

> On Thu, May 31, 2018 at 11:26:43AM -0700, Dan Williams wrote:
> > On Thu, May 31, 2018 at 10:46 AM, Darrick J. Wong
> > <darrick.wong@oracle.com> wrote:
> > > On Thu, May 31, 2018 at 09:29:15AM -0700, Dan Williams wrote:
> > >> On Thu, May 31, 2018 at 8:07 AM, Ross Zwisler
> > >> <ross.zwisler@linux.intel.com> wrote:
> > >> > On Thu, May 31, 2018 at 11:27:33AM +0900, Yasunori Goto wrote:
> > >> >> Hello,
> > >> >>
> > >> >>
> > >> >> I would like to know about the Experimental message of Filesystem DAX.
> > >> >> --------------------------------------------------------
> > >> >> DAX enabled. Warning: EXPERIMENTAL, use at your own risk
> > >> >> --------------------------------------------------------
> > >> >>
> > >> >> AFAIK, the final issue of Filesystem DAX is metadata update problem,
> > >> >> and it is(will be?) solved by great effort of MAP_SYNC and
> > >> >> "fix dma vs truncate/hole-punch" patch set.
> > >> >> So, I suppose that the Experimental message can be removed,
> > >> >> but I'm not sure.
> > >> >>
> > >> >> Is it possible?
> > >> >> Otherwise, are there any other issues in Filesystem DAX yet?
> > >> >>
> > >> >> If this is silly question, sorry for noise....
> > >> >>
> > >> >> Thanks,
> > >> >> ---
> > >> >> Yasunori Goto
> > >> >
> > >> > Adding in the XFS and ext4 developers, as it's really their call when to
> > >> > remove this notice.
> > >> >
> > >> > We've talked about this off and on for a long while, but IMHO we should remove
> > >> > the EXPERIMENTAL warning.  The last few things that we had on our TODO list
> > >> > before this was removed were:
> > >> >
> > >> > 1) Get consistent handling of the DAX mount option.  We currently have this,
> > >> > as both filesystems will behave the same and fall back and remove the DAX
> > >> > mount option if it is unsupported by the block device, etc.
> > >
> > > <nod>
> > >
> > > As an aside, I wonder if Christoph's musings about "just have the kernel
> > > determine the appropriate dax/non-dax setting from the acpi tables and
> > > skip the inode flag entirely" ever got resolved?
> > >
> > >> > 2) Get consistent handling of the DAX inode option.  We currently have this,
> > >> > as all DAX behavior now happens through the mount option.  If/when we
> > >> > re-enable the per-inode DAX flag we should do it consistently for all DAX
> > >> > enabled filesystems.
> > >
> > > The behavior of the inode flag isn't all that consistent.  ext4 doesn't
> > > support it at all.  On XFS, you can set or clear FS_XFLAG_DAX on a
> > > directory which will propagate the setting to any files created in that
> > > directory.
> > >
> > > However, if you set or clear it on a file we update the on-disk inode
> > > but we can't change the in-core state flag (S_DAX) until the next
> > > in-core inode instantiation.  It's weird that users can change the flag
> > > but the intended behavior changes won't happen until some ... time ...
> > > in the future??
> > >
> > >> > 3) Make DAX work with other XFS features like reflink, etc.  This one isn't
> > >> > done, but we at least disallow DAX with XFS features like reflink where it
> > >> > could be an issue.  Darrick, do you still feel like we need to get these
> > >> > working together to remove EXPERIMENTAL, or are you happy enough that we're
> > >> > keeping them separated and that we're keeping user data safe?
> > >
> > > Yes, reflink and dax still need to work together.  I've not heard any
> > > good arguments for why page sharing + copy on write are fundamentally
> > > incompatible with the dax model, or why dax users will never, ever
> > > require reflink.
> > 
> > Right, but that's separate from DAX being scream in your face
> > "EXPERIMENTAL!". It's just an additional feature that can be added on
> > once all the normal expectations of a userspace mapping work. I think
> > reliable rmap is the last of those requirements.
> > 
> > > The recent thread between Jan and Dan make me wonder if making mappings
> > > share struct pages is going to be a nightmare to add to the mm code,
> > > though...
> > 
> > It's going to be a bit messy because a singular page->mapping
> > association is fundamentally incompatible with DAX. Perhaps a linked
> > list of mapping "siblings"?
> > 
> > > Also: ideally XFS would also be able to consume poison event
> > > notifications from the pmem so that it can try to deal with metadata
> > > loss, but that's probably a separate effort.
> > 
> > Right, not a gating item for declaring DAX ready for prime time.
> 
> Yep, I think that the very loud EXPERIMENTAL message is essentially telling
> users "your data is at risk if you use this".  I totally agree that we still
> have lots of work to do.  However, I don't think that these feature
> enhancements should gate removal of the EXPERIMENTAL notice.   IMHO that
> should only exist as long as we have issues that we know could corrupt data,
> crash the box, etc.  As far as I know those are basically the 2 items on Dan's
> list from a few mails ago (poison recovery & DMA vs truncate).

Everyone,

Thank you very much for your information/opinions.
Not only about "experimental", I could understand what is still to do.

Thanks a lot!
---
Yasunori Goto




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Question about Experimental of Filesystem DAX.
@ 2018-06-04  3:51             ` Dave Chinner
  0 siblings, 0 replies; 31+ messages in thread
From: Dave Chinner @ 2018-06-04  3:51 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, Darrick J. Wong, NVDIMM-ML, linux-xfs, Yasunori Goto,
	linux-ext4

On Thu, May 31, 2018 at 02:25:56PM -0600, Ross Zwisler wrote:
> > Right, not a gating item for declaring DAX ready for prime time.
> 
> Yep, I think that the very loud EXPERIMENTAL message is essentially telling
> users "your data is at risk if you use this".

And that's a call the filesystem maintainers need to make, not the
DAX developers. It's clear from recent "oh fuck, DAX on XFS has
stopped working in 4.17" a long time after after the changes that
broke went into the mainline kernel indicates we have a serious
testing problem here.

i.e. that the filesystem developers who will have to maintain this
stuff and deal with all the "DAX ate my data" bug reports haven't
been testing DAX on their filesysetms at all in recent times. And
that kinda says to me that the EXPERIMENTAL flag is not getting
removed any time soon....

> As far as I know those are basically the 2 items on Dan's
> list from a few mails ago (poison recovery & DMA vs truncate).

FWIW, I have not been testing DAX because I'm waiting for
infrastructure problems like DMA vs truncate to get fixed first....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Question about Experimental of Filesystem DAX.
@ 2018-06-04  3:51             ` Dave Chinner
  0 siblings, 0 replies; 31+ messages in thread
From: Dave Chinner @ 2018-06-04  3:51 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, Darrick J. Wong, NVDIMM-ML, linux-xfs, Yasunori Goto,
	linux-ext4

On Thu, May 31, 2018 at 02:25:56PM -0600, Ross Zwisler wrote:
> > Right, not a gating item for declaring DAX ready for prime time.
> 
> Yep, I think that the very loud EXPERIMENTAL message is essentially telling
> users "your data is at risk if you use this".

And that's a call the filesystem maintainers need to make, not the
DAX developers. It's clear from recent "oh fuck, DAX on XFS has
stopped working in 4.17" a long time after after the changes that
broke went into the mainline kernel indicates we have a serious
testing problem here.

i.e. that the filesystem developers who will have to maintain this
stuff and deal with all the "DAX ate my data" bug reports haven't
been testing DAX on their filesysetms at all in recent times. And
that kinda says to me that the EXPERIMENTAL flag is not getting
removed any time soon....

> As far as I know those are basically the 2 items on Dan's
> list from a few mails ago (poison recovery & DMA vs truncate).

FWIW, I have not been testing DAX because I'm waiting for
infrastructure problems like DMA vs truncate to get fixed first....

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Question about Experimental of Filesystem DAX.
@ 2018-06-04  3:51             ` Dave Chinner
  0 siblings, 0 replies; 31+ messages in thread
From: Dave Chinner @ 2018-06-04  3:51 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Dan Williams, Darrick J. Wong, Yasunori Goto, Jan Kara,
	linux-xfs, linux-ext4, NVDIMM-ML

On Thu, May 31, 2018 at 02:25:56PM -0600, Ross Zwisler wrote:
> > Right, not a gating item for declaring DAX ready for prime time.
> 
> Yep, I think that the very loud EXPERIMENTAL message is essentially telling
> users "your data is at risk if you use this".

And that's a call the filesystem maintainers need to make, not the
DAX developers. It's clear from recent "oh fuck, DAX on XFS has
stopped working in 4.17" a long time after after the changes that
broke went into the mainline kernel indicates we have a serious
testing problem here.

i.e. that the filesystem developers who will have to maintain this
stuff and deal with all the "DAX ate my data" bug reports haven't
been testing DAX on their filesysetms at all in recent times. And
that kinda says to me that the EXPERIMENTAL flag is not getting
removed any time soon....

> As far as I know those are basically the 2 items on Dan's
> list from a few mails ago (poison recovery & DMA vs truncate).

FWIW, I have not been testing DAX because I'm waiting for
infrastructure problems like DMA vs truncate to get fixed first....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Question about Experimental of Filesystem DAX.
  2018-05-31 23:05           ` Dave Chinner
  (?)
@ 2018-06-07 14:38             ` Jan Kara
  -1 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2018-06-07 14:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Darrick J. Wong, NVDIMM-ML, linux-xfs, Yasunori Goto,
	linux-ext4

On Fri 01-06-18 09:05:50, Dave Chinner wrote:
> On Thu, May 31, 2018 at 11:26:43AM -0700, Dan Williams wrote:
> > On Thu, May 31, 2018 at 10:46 AM, Darrick J. Wong
> > > The recent thread between Jan and Dan make me wonder if making mappings
> > > share struct pages is going to be a nightmare to add to the mm code,
> > > though...
> > 
> > It's going to be a bit messy because a singular page->mapping
> > association is fundamentally incompatible with DAX. Perhaps a linked
> > list of mapping "siblings"?
> 
> I'd much prefer the filesystem allocate/control the struct page that
> is inserted into mapping trees so we can have multiple struct pages
> pointing at the one physical page.  That way we can just insert
> these dynamic struct pages into the relevant mappings and it works
> the same way for both DAX and shared page cache pages.

Yes, that's one option but the overhead in terms of memory and CPU is
non-trivial and as Dan writes there are assumptions in MM code that
PFN<->struct page is 1:1 relationship (or possibly 1:0 as struct page can
be missing for certain types of pfns). But at this point I think the exact
data structure layout is not that important (whether there will be dynamic
struct pages or some other linkage). I think we first need to settle on
how responsibilities between MM and filesystems are going to be split.

> i.e. the filesystem knows they are shared physical blocks, the
> filesystem controls COW of physical blocks, the filesystem controls
> truncate/invalidation of physical blocks, the filesystem controls
> cache state of the physical blocks. So why are we designing
> infrastructure around the virtual memory and caching infrastructure
> that bypasses the layer that manages and arbitrates access to the
> physical storage?

I agree with this but let's see whether we are on the same page.  What MM
mostly cares about and in particular what Dan needs to solve is "given
struct page, give me all page tables that map this page". And it seems
completely fair to me to maintain such translation within MM code as the
filesystem generally does not care in too big detail about memory mappings
of files.  When something with the page is going happen (like breaking cow,
truncate, whatever), the filesystem just tells MM to invalidate all page
tables for the page and subsequent page faults will fill in updated
information - nothing new here.

Then there's a second slightly different question - and I suspect you are
speaking about that one - "given struct page, give me all radix trees
pointing to this page". This is more a filesystem caching question (even in
case of DAX, you can think of this as caching of inode + logical offset =>
physical block translations). Currently this translation is maintained by
page cache and again the filesystem is mostly ignorant of details of what
and when gets cached and just tells the MM when it should throw away the
cached information (through truncate/invalidate_inode_pages()).

With reflink and DAX / shared page cache these questions get more complex
to answer but in principle I don't see why the mapping from the struct page
to radix trees should not be maintained (with a help of the filesystem) in
generic code. Sure the interface now needs to be more flexible and in that
sense filesystems will be more in control what the page cache is doing -
e.g. ->readpage callback would probably need to be passed just inode +
logical offset and *the filesystem* will now need to find / allocate
approprite page, fill it up, and tell page cache this page is now caching
given offset in the file. And page cache will add this inode + offset to a
list of things caching this page. Do you agree or you had something
different in mind?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Question about Experimental of Filesystem DAX.
@ 2018-06-07 14:38             ` Jan Kara
  0 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2018-06-07 14:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Darrick J. Wong, NVDIMM-ML, linux-xfs, Yasunori Goto,
	linux-ext4

On Fri 01-06-18 09:05:50, Dave Chinner wrote:
> On Thu, May 31, 2018 at 11:26:43AM -0700, Dan Williams wrote:
> > On Thu, May 31, 2018 at 10:46 AM, Darrick J. Wong
> > > The recent thread between Jan and Dan make me wonder if making mappings
> > > share struct pages is going to be a nightmare to add to the mm code,
> > > though...
> > 
> > It's going to be a bit messy because a singular page->mapping
> > association is fundamentally incompatible with DAX. Perhaps a linked
> > list of mapping "siblings"?
> 
> I'd much prefer the filesystem allocate/control the struct page that
> is inserted into mapping trees so we can have multiple struct pages
> pointing at the one physical page.  That way we can just insert
> these dynamic struct pages into the relevant mappings and it works
> the same way for both DAX and shared page cache pages.

Yes, that's one option but the overhead in terms of memory and CPU is
non-trivial and as Dan writes there are assumptions in MM code that
PFN<->struct page is 1:1 relationship (or possibly 1:0 as struct page can
be missing for certain types of pfns). But at this point I think the exact
data structure layout is not that important (whether there will be dynamic
struct pages or some other linkage). I think we first need to settle on
how responsibilities between MM and filesystems are going to be split.

> i.e. the filesystem knows they are shared physical blocks, the
> filesystem controls COW of physical blocks, the filesystem controls
> truncate/invalidation of physical blocks, the filesystem controls
> cache state of the physical blocks. So why are we designing
> infrastructure around the virtual memory and caching infrastructure
> that bypasses the layer that manages and arbitrates access to the
> physical storage?

I agree with this but let's see whether we are on the same page.  What MM
mostly cares about and in particular what Dan needs to solve is "given
struct page, give me all page tables that map this page". And it seems
completely fair to me to maintain such translation within MM code as the
filesystem generally does not care in too big detail about memory mappings
of files.  When something with the page is going happen (like breaking cow,
truncate, whatever), the filesystem just tells MM to invalidate all page
tables for the page and subsequent page faults will fill in updated
information - nothing new here.

Then there's a second slightly different question - and I suspect you are
speaking about that one - "given struct page, give me all radix trees
pointing to this page". This is more a filesystem caching question (even in
case of DAX, you can think of this as caching of inode + logical offset =>
physical block translations). Currently this translation is maintained by
page cache and again the filesystem is mostly ignorant of details of what
and when gets cached and just tells the MM when it should throw away the
cached information (through truncate/invalidate_inode_pages()).

With reflink and DAX / shared page cache these questions get more complex
to answer but in principle I don't see why the mapping from the struct page
to radix trees should not be maintained (with a help of the filesystem) in
generic code. Sure the interface now needs to be more flexible and in that
sense filesystems will be more in control what the page cache is doing -
e.g. ->readpage callback would probably need to be passed just inode +
logical offset and *the filesystem* will now need to find / allocate
approprite page, fill it up, and tell page cache this page is now caching
given offset in the file. And page cache will add this inode + offset to a
list of things caching this page. Do you agree or you had something
different in mind?

								Honza
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Question about Experimental of Filesystem DAX.
@ 2018-06-07 14:38             ` Jan Kara
  0 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2018-06-07 14:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dan Williams, Darrick J. Wong, Ross Zwisler, Yasunori Goto,
	Jan Kara, linux-xfs, linux-ext4, NVDIMM-ML

On Fri 01-06-18 09:05:50, Dave Chinner wrote:
> On Thu, May 31, 2018 at 11:26:43AM -0700, Dan Williams wrote:
> > On Thu, May 31, 2018 at 10:46 AM, Darrick J. Wong
> > > The recent thread between Jan and Dan make me wonder if making mappings
> > > share struct pages is going to be a nightmare to add to the mm code,
> > > though...
> > 
> > It's going to be a bit messy because a singular page->mapping
> > association is fundamentally incompatible with DAX. Perhaps a linked
> > list of mapping "siblings"?
> 
> I'd much prefer the filesystem allocate/control the struct page that
> is inserted into mapping trees so we can have multiple struct pages
> pointing at the one physical page.  That way we can just insert
> these dynamic struct pages into the relevant mappings and it works
> the same way for both DAX and shared page cache pages.

Yes, that's one option but the overhead in terms of memory and CPU is
non-trivial and as Dan writes there are assumptions in MM code that
PFN<->struct page is 1:1 relationship (or possibly 1:0 as struct page can
be missing for certain types of pfns). But at this point I think the exact
data structure layout is not that important (whether there will be dynamic
struct pages or some other linkage). I think we first need to settle on
how responsibilities between MM and filesystems are going to be split.

> i.e. the filesystem knows they are shared physical blocks, the
> filesystem controls COW of physical blocks, the filesystem controls
> truncate/invalidation of physical blocks, the filesystem controls
> cache state of the physical blocks. So why are we designing
> infrastructure around the virtual memory and caching infrastructure
> that bypasses the layer that manages and arbitrates access to the
> physical storage?

I agree with this but let's see whether we are on the same page.  What MM
mostly cares about and in particular what Dan needs to solve is "given
struct page, give me all page tables that map this page". And it seems
completely fair to me to maintain such translation within MM code as the
filesystem generally does not care in too big detail about memory mappings
of files.  When something with the page is going happen (like breaking cow,
truncate, whatever), the filesystem just tells MM to invalidate all page
tables for the page and subsequent page faults will fill in updated
information - nothing new here.

Then there's a second slightly different question - and I suspect you are
speaking about that one - "given struct page, give me all radix trees
pointing to this page". This is more a filesystem caching question (even in
case of DAX, you can think of this as caching of inode + logical offset =>
physical block translations). Currently this translation is maintained by
page cache and again the filesystem is mostly ignorant of details of what
and when gets cached and just tells the MM when it should throw away the
cached information (through truncate/invalidate_inode_pages()).

With reflink and DAX / shared page cache these questions get more complex
to answer but in principle I don't see why the mapping from the struct page
to radix trees should not be maintained (with a help of the filesystem) in
generic code. Sure the interface now needs to be more flexible and in that
sense filesystems will be more in control what the page cache is doing -
e.g. ->readpage callback would probably need to be passed just inode +
logical offset and *the filesystem* will now need to find / allocate
approprite page, fill it up, and tell page cache this page is now caching
given offset in the file. And page cache will add this inode + offset to a
list of things caching this page. Do you agree or you had something
different in mind?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2018-06-07 14:38 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-31  2:27 Question about Experimental of Filesystem DAX Yasunori Goto
2018-05-31 15:07 ` Ross Zwisler
2018-05-31 15:07   ` Ross Zwisler
2018-05-31 15:07   ` Ross Zwisler
2018-05-31 16:29   ` Dan Williams
2018-05-31 16:29     ` Dan Williams
2018-05-31 16:29     ` Dan Williams
2018-05-31 17:46     ` Darrick J. Wong
2018-05-31 17:46       ` Darrick J. Wong
2018-05-31 17:46       ` Darrick J. Wong
2018-05-31 18:26       ` Dan Williams
2018-05-31 18:26         ` Dan Williams
2018-05-31 18:26         ` Dan Williams
2018-05-31 20:25         ` Ross Zwisler
2018-05-31 20:25           ` Ross Zwisler
2018-05-31 20:25           ` Ross Zwisler
2018-06-04  1:44           ` Yasunori Goto
2018-06-04  1:44             ` Yasunori Goto
2018-06-04  1:44             ` Yasunori Goto
2018-06-04  3:51           ` Dave Chinner
2018-06-04  3:51             ` Dave Chinner
2018-06-04  3:51             ` Dave Chinner
2018-05-31 23:05         ` Dave Chinner
2018-05-31 23:05           ` Dave Chinner
2018-05-31 23:05           ` Dave Chinner
2018-06-01  1:03           ` Dan Williams
2018-06-01  1:03             ` Dan Williams
2018-06-01  1:03             ` Dan Williams
2018-06-07 14:38           ` Jan Kara
2018-06-07 14:38             ` Jan Kara
2018-06-07 14:38             ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.