nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* can we finally kill off CONFIG_FS_DAX_LIMITED
@ 2021-08-20  5:43 Christoph Hellwig
  2021-08-20 15:41 ` Dan Williams
  2021-08-23 14:05 ` Gerald Schaefer
  0 siblings, 2 replies; 22+ messages in thread
From: Christoph Hellwig @ 2021-08-20  5:43 UTC (permalink / raw)
  To: Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Dan Williams, nvdimm, linux-s390

Hi all,

looking at the recent ZONE_DEVICE related changes we still have a
horrible maze of different code paths.  I already suggested to
depend on ARCH_HAS_PTE_SPECIAL for ZONE_DEVICE there, which all modern
architectures have anyway.  But the other odd special case is
CONFIG_FS_DAX_LIMITED which is just used for the xpram driver.  Does
this driver still see use?  If so can we make it behave like the
other DAX drivers and require a pgmap?  I think the biggest missing
part would be to implement ARCH_HAS_PTE_DEVMAP for s390.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: can we finally kill off CONFIG_FS_DAX_LIMITED
  2021-08-20  5:43 can we finally kill off CONFIG_FS_DAX_LIMITED Christoph Hellwig
@ 2021-08-20 15:41 ` Dan Williams
  2021-08-20 17:42   ` Dan Williams
  2021-08-23 14:05 ` Gerald Schaefer
  1 sibling, 1 reply; 22+ messages in thread
From: Dan Williams @ 2021-08-20 15:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Linux NVDIMM, linux-s390, Gerald Schaefer, Joao Martins

[ add Gerald and Joao ]

On Thu, Aug 19, 2021 at 10:44 PM Christoph Hellwig <hch@lst.de> wrote:
>
> Hi all,
>
> looking at the recent ZONE_DEVICE related changes we still have a
> horrible maze of different code paths.  I already suggested to
> depend on ARCH_HAS_PTE_SPECIAL for ZONE_DEVICE there, which all modern
> architectures have anyway.  But the other odd special case is
> CONFIG_FS_DAX_LIMITED which is just used for the xpram driver.  Does
> this driver still see use?  If so can we make it behave like the
> other DAX drivers and require a pgmap?  I think the biggest missing
> part would be to implement ARCH_HAS_PTE_DEVMAP for s390.
>

Gerald,

Might you still be looking to help dcssblk get out of FS_DAX_LIMITED
jail [1]? I recall Martin saying that 'struct page' overhead was
prohibitive. I don't know if Joao's 'struct page' diet patches could
help alleviate that at all (would require the filesystem to only
allocate blocks in large page sizes).

[1]: https://lore.kernel.org/r/20180523205017.0f2bc83e@thinkpad

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: can we finally kill off CONFIG_FS_DAX_LIMITED
  2021-08-20 15:41 ` Dan Williams
@ 2021-08-20 17:42   ` Dan Williams
  2021-08-20 19:03     ` Gerald Schaefer
  2021-08-24 14:17     ` Joao Martins
  0 siblings, 2 replies; 22+ messages in thread
From: Dan Williams @ 2021-08-20 17:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Linux NVDIMM, linux-s390, Joao Martins, Gerald Schaefer

[ fix Gerald's email ]

On Fri, Aug 20, 2021 at 8:41 AM Dan Williams <dan.j.williams@intel.com> wrote:
>
> [ add Gerald and Joao ]
>
> On Thu, Aug 19, 2021 at 10:44 PM Christoph Hellwig <hch@lst.de> wrote:
> >
> > Hi all,
> >
> > looking at the recent ZONE_DEVICE related changes we still have a
> > horrible maze of different code paths.  I already suggested to
> > depend on ARCH_HAS_PTE_SPECIAL for ZONE_DEVICE there, which all modern
> > architectures have anyway.  But the other odd special case is
> > CONFIG_FS_DAX_LIMITED which is just used for the xpram driver.  Does
> > this driver still see use?  If so can we make it behave like the
> > other DAX drivers and require a pgmap?  I think the biggest missing
> > part would be to implement ARCH_HAS_PTE_DEVMAP for s390.
> >
>
> Gerald,
>
> Might you still be looking to help dcssblk get out of FS_DAX_LIMITED
> jail [1]? I recall Martin saying that 'struct page' overhead was
> prohibitive. I don't know if Joao's 'struct page' diet patches could
> help alleviate that at all (would require the filesystem to only
> allocate blocks in large page sizes).
>
> [1]: https://lore.kernel.org/r/20180523205017.0f2bc83e@thinkpad

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: can we finally kill off CONFIG_FS_DAX_LIMITED
  2021-08-20 17:42   ` Dan Williams
@ 2021-08-20 19:03     ` Gerald Schaefer
  2021-08-24 14:17     ` Joao Martins
  1 sibling, 0 replies; 22+ messages in thread
From: Gerald Schaefer @ 2021-08-20 19:03 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Linux NVDIMM, linux-s390, Joao Martins

On Fri, 20 Aug 2021 10:42:14 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> [ fix Gerald's email ]
> 
> On Fri, Aug 20, 2021 at 8:41 AM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > [ add Gerald and Joao ]
> >
> > On Thu, Aug 19, 2021 at 10:44 PM Christoph Hellwig <hch@lst.de> wrote:
> > >
> > > Hi all,
> > >
> > > looking at the recent ZONE_DEVICE related changes we still have a
> > > horrible maze of different code paths.  I already suggested to
> > > depend on ARCH_HAS_PTE_SPECIAL for ZONE_DEVICE there, which all modern
> > > architectures have anyway.  But the other odd special case is
> > > CONFIG_FS_DAX_LIMITED which is just used for the xpram driver.  Does
> > > this driver still see use?  If so can we make it behave like the
> > > other DAX drivers and require a pgmap?  I think the biggest missing
> > > part would be to implement ARCH_HAS_PTE_DEVMAP for s390.
> > >
> >
> > Gerald,
> >
> > Might you still be looking to help dcssblk get out of FS_DAX_LIMITED
> > jail [1]? I recall Martin saying that 'struct page' overhead was
> > prohibitive. I don't know if Joao's 'struct page' diet patches could
> > help alleviate that at all (would require the filesystem to only
> > allocate blocks in large page sizes).
> >
> > [1]: https://lore.kernel.org/r/20180523205017.0f2bc83e@thinkpad

Ouch, yes, that is actually (still) on my list. Also, adding struct
pages for dcssblk in XIP / DAX mode is not really prohibitive. The
whole feature was designed to allow saving memory when the same
binaries need to be executed from many z/VM guests. The XIP
(execute-in-place) part allows to execute them directly from the
DCSS (shared) memory segment, saving page cache usage on all guests.

Additionally saving the struct pages for the DCSS memory itself can
be considered a bonus, but it's just 1/64th of the otherwise saved
memory. For this reason, and also because there are hardly any users
of that feature left, we certainly can do w/o that, i.e. live with
having the struct pages.

I must admit that I neglected this a bit, and it was only a question
of time until this popped up again (rightfully). I understand that
having and maintaining this FS_DAX_LIMITED workaround is really ugly.

So I will give this some priority now, and I hope that it still only
affects dcssblk, and we did not get any other non-s390 users in the
meantime.

BTW, s390 xpram driver should not be affected. AFAIR, we also might
not have struct pages there, but it also is not possible to use it
with DAX, so it should not care about FS_DAX_LIMITED.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: can we finally kill off CONFIG_FS_DAX_LIMITED
  2021-08-20  5:43 can we finally kill off CONFIG_FS_DAX_LIMITED Christoph Hellwig
  2021-08-20 15:41 ` Dan Williams
@ 2021-08-23 14:05 ` Gerald Schaefer
  2021-08-23 19:47   ` Gerald Schaefer
  2021-08-24  6:49   ` David Hildenbrand
  1 sibling, 2 replies; 22+ messages in thread
From: Gerald Schaefer @ 2021-08-23 14:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Dan Williams, nvdimm, linux-s390

On Fri, 20 Aug 2021 07:43:40 +0200
Christoph Hellwig <hch@lst.de> wrote:

> Hi all,
> 
> looking at the recent ZONE_DEVICE related changes we still have a
> horrible maze of different code paths.  I already suggested to
> depend on ARCH_HAS_PTE_SPECIAL for ZONE_DEVICE there, which all modern
> architectures have anyway.  But the other odd special case is
> CONFIG_FS_DAX_LIMITED which is just used for the xpram driver.  Does
> this driver still see use?  If so can we make it behave like the
> other DAX drivers and require a pgmap?  I think the biggest missing
> part would be to implement ARCH_HAS_PTE_DEVMAP for s390.

Puh, yes, that seems to be needed in order to enable ZONE_DEVICE, and
then we could use devm_memremap_pages(), at least that was my plan
some time ago. However, either the ARCH_HAS_PTE_DEVMAP dependency
is new, or I overlooked it before, but we do not have any free bits
in the pte left, so this is not going to work.

Would it strictly be necessary to implement ZONE_DEVICE, or would
it be enough if we would use e.g. add_memory() instead of just
adding the DCSS memory directly to the kernel mapping via
vmem_add_mapping()? That way we might at least get the struct pages,
but somehow it doesn't feel completely right.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: can we finally kill off CONFIG_FS_DAX_LIMITED
  2021-08-23 14:05 ` Gerald Schaefer
@ 2021-08-23 19:47   ` Gerald Schaefer
  2021-08-23 20:21     ` Dan Williams
  2021-08-24  6:49   ` David Hildenbrand
  1 sibling, 1 reply; 22+ messages in thread
From: Gerald Schaefer @ 2021-08-23 19:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Dan Williams, nvdimm, linux-s390

On Mon, 23 Aug 2021 16:05:46 +0200
Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:

> On Fri, 20 Aug 2021 07:43:40 +0200
> Christoph Hellwig <hch@lst.de> wrote:
> 
> > Hi all,
> > 
> > looking at the recent ZONE_DEVICE related changes we still have a
> > horrible maze of different code paths.  I already suggested to
> > depend on ARCH_HAS_PTE_SPECIAL for ZONE_DEVICE there, which all modern

Oh, we do have PTE_SPECIAL, actually that took away the last free bit
in the pte. So, if there is a chance that ZONE_DEVICE would depend
on PTE_SPECIAL instead of PTE_DEVMAP, we might be back in the game
and get rid of that CONFIG_FS_DAX_LIMITED.

Or did you rather mean depend on ARCH_HAS_PTE_SPECIAL on top of
ARCH_HAS_PTE_DEVMAP?

> > architectures have anyway.  But the other odd special case is
> > CONFIG_FS_DAX_LIMITED which is just used for the xpram driver.  Does
> > this driver still see use?  If so can we make it behave like the
> > other DAX drivers and require a pgmap?  I think the biggest missing
> > part would be to implement ARCH_HAS_PTE_DEVMAP for s390.
> 
> Puh, yes, that seems to be needed in order to enable ZONE_DEVICE, and
> then we could use devm_memremap_pages(), at least that was my plan
> some time ago. However, either the ARCH_HAS_PTE_DEVMAP dependency
> is new, or I overlooked it before, but we do not have any free bits
> in the pte left, so this is not going to work.
> 
> Would it strictly be necessary to implement ZONE_DEVICE, or would
> it be enough if we would use e.g. add_memory() instead of just
> adding the DCSS memory directly to the kernel mapping via
> vmem_add_mapping()? That way we might at least get the struct pages,
> but somehow it doesn't feel completely right.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: can we finally kill off CONFIG_FS_DAX_LIMITED
  2021-08-23 19:47   ` Gerald Schaefer
@ 2021-08-23 20:21     ` Dan Williams
  2021-08-24 14:09       ` Joao Martins
  0 siblings, 1 reply; 22+ messages in thread
From: Dan Williams @ 2021-08-23 20:21 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Christoph Hellwig, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Linux NVDIMM, linux-s390

On Mon, Aug 23, 2021 at 12:47 PM Gerald Schaefer
<gerald.schaefer@linux.ibm.com> wrote:
>
> On Mon, 23 Aug 2021 16:05:46 +0200
> Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:
>
> > On Fri, 20 Aug 2021 07:43:40 +0200
> > Christoph Hellwig <hch@lst.de> wrote:
> >
> > > Hi all,
> > >
> > > looking at the recent ZONE_DEVICE related changes we still have a
> > > horrible maze of different code paths.  I already suggested to
> > > depend on ARCH_HAS_PTE_SPECIAL for ZONE_DEVICE there, which all modern
>
> Oh, we do have PTE_SPECIAL, actually that took away the last free bit
> in the pte. So, if there is a chance that ZONE_DEVICE would depend
> on PTE_SPECIAL instead of PTE_DEVMAP, we might be back in the game
> and get rid of that CONFIG_FS_DAX_LIMITED.

So PTE_DEVMAP is primarily there to coordinate the
get_user_pages_fast() path, and even there it's usage can be
eliminated in favor of PTE_SPECIAL. I started that effort [1], but
need to rebase on new notify_failure infrastructure coming from Ruan
[2]. So I think you are not in the critical path until I can get the
PTE_DEVMAP requirement out of your way.

[1]: https://lore.kernel.org/r/161604050866.1463742.7759521510383551055.stgit@dwillia2-desk3.amr.corp.intel.com

[2]: https://lore.kernel.org/r/20210730085245.3069812-1-ruansy.fnst@fujitsu.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: can we finally kill off CONFIG_FS_DAX_LIMITED
  2021-08-23 14:05 ` Gerald Schaefer
  2021-08-23 19:47   ` Gerald Schaefer
@ 2021-08-24  6:49   ` David Hildenbrand
  1 sibling, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2021-08-24  6:49 UTC (permalink / raw)
  To: Gerald Schaefer, Christoph Hellwig
  Cc: Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Dan Williams, nvdimm, linux-s390

On 23.08.21 16:05, Gerald Schaefer wrote:
> On Fri, 20 Aug 2021 07:43:40 +0200
> Christoph Hellwig <hch@lst.de> wrote:
> 
>> Hi all,
>>
>> looking at the recent ZONE_DEVICE related changes we still have a
>> horrible maze of different code paths.  I already suggested to
>> depend on ARCH_HAS_PTE_SPECIAL for ZONE_DEVICE there, which all modern
>> architectures have anyway.  But the other odd special case is
>> CONFIG_FS_DAX_LIMITED which is just used for the xpram driver.  Does
>> this driver still see use?  If so can we make it behave like the
>> other DAX drivers and require a pgmap?  I think the biggest missing
>> part would be to implement ARCH_HAS_PTE_DEVMAP for s390.
> 
> Puh, yes, that seems to be needed in order to enable ZONE_DEVICE, and
> then we could use devm_memremap_pages(), at least that was my plan
> some time ago. However, either the ARCH_HAS_PTE_DEVMAP dependency
> is new, or I overlooked it before, but we do not have any free bits
> in the pte left, so this is not going to work.
> 
> Would it strictly be necessary to implement ZONE_DEVICE, or would
> it be enough if we would use e.g. add_memory() instead of just
> adding the DCSS memory directly to the kernel mapping via
> vmem_add_mapping()? That way we might at least get the struct pages,
> but somehow it doesn't feel completely right.
> 

add_memory() is for adding system RAM. I don't think that's what you 
want in the case of DCSS. Supporting ZONE_DEVICE cleanly would be ideal.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: can we finally kill off CONFIG_FS_DAX_LIMITED
  2021-08-23 20:21     ` Dan Williams
@ 2021-08-24 14:09       ` Joao Martins
  2021-08-24 14:53         ` Dan Williams
  0 siblings, 1 reply; 22+ messages in thread
From: Joao Martins @ 2021-08-24 14:09 UTC (permalink / raw)
  To: Dan Williams, Gerald Schaefer
  Cc: Christoph Hellwig, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Linux NVDIMM, linux-s390



On 8/23/21 9:21 PM, Dan Williams wrote:
> On Mon, Aug 23, 2021 at 12:47 PM Gerald Schaefer
> <gerald.schaefer@linux.ibm.com> wrote:
>>
>> On Mon, 23 Aug 2021 16:05:46 +0200
>> Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:
>>
>>> On Fri, 20 Aug 2021 07:43:40 +0200
>>> Christoph Hellwig <hch@lst.de> wrote:
>>>
>>>> Hi all,
>>>>
>>>> looking at the recent ZONE_DEVICE related changes we still have a
>>>> horrible maze of different code paths.  I already suggested to
>>>> depend on ARCH_HAS_PTE_SPECIAL for ZONE_DEVICE there, which all modern
>>
>> Oh, we do have PTE_SPECIAL, actually that took away the last free bit
>> in the pte. So, if there is a chance that ZONE_DEVICE would depend
>> on PTE_SPECIAL instead of PTE_DEVMAP, we might be back in the game
>> and get rid of that CONFIG_FS_DAX_LIMITED.
> 
> So PTE_DEVMAP is primarily there to coordinate the
> get_user_pages_fast() path, and even there it's usage can be
> eliminated in favor of PTE_SPECIAL. I started that effort [1], but
> need to rebase on new notify_failure infrastructure coming from Ruan
> [2]. So I think you are not in the critical path until I can get the
> PTE_DEVMAP requirement out of your way.
> 

Isn't the implicit case that PTE_SPECIAL means that you
aren't supposed to get a struct page back? The gup path bails out on
pte_special() case. And in the fact in this thread that you quote:

> [1]: https://lore.kernel.org/r/161604050866.1463742.7759521510383551055.stgit@dwillia2-desk3.amr.corp.intel.com

(...) we were speaking about[1.1] using that same special bit to block
longterm gup for fs-dax (while allowing it device-dax which does support it).

[1.1] https://lore.kernel.org/nvdimm/a8c41028-c7f5-9b93-4721-b8ddcf2427da@oracle.com/

Or maybe that's what you mean for this particular case of FS_DAX_LIMITED. Most _special*()
cases in mm match _devmap*() as far I've experimented in the past with PMD/PUD and dax
(prior to [1.1]).

I am just wondering would you differentiate the case where you have metadata for the
!FS_DAX_LIMITED case in {gup,gup_fast} path in light of removing PTE_DEVMAP. I would have
thought of checking that a pgmap exists for the pfn (without grabbing a ref to it).

> 
> [2]: https://lore.kernel.org/r/20210730085245.3069812-1-ruansy.fnst@fujitsu.com
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: can we finally kill off CONFIG_FS_DAX_LIMITED
  2021-08-20 17:42   ` Dan Williams
  2021-08-20 19:03     ` Gerald Schaefer
@ 2021-08-24 14:17     ` Joao Martins
  1 sibling, 0 replies; 22+ messages in thread
From: Joao Martins @ 2021-08-24 14:17 UTC (permalink / raw)
  To: Dan Williams
  Cc: Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Linux NVDIMM, linux-s390, Gerald Schaefer, Christoph Hellwig

On 8/20/21 6:42 PM, Dan Williams wrote:
> On Fri, Aug 20, 2021 at 8:41 AM Dan Williams <dan.j.williams@intel.com> wrote:
>>
>> [ add Gerald and Joao ]
>>
>> On Thu, Aug 19, 2021 at 10:44 PM Christoph Hellwig <hch@lst.de> wrote:
>>>
>>> Hi all,
>>>
>>> looking at the recent ZONE_DEVICE related changes we still have a
>>> horrible maze of different code paths.  I already suggested to
>>> depend on ARCH_HAS_PTE_SPECIAL for ZONE_DEVICE there, which all modern
>>> architectures have anyway.  But the other odd special case is
>>> CONFIG_FS_DAX_LIMITED which is just used for the xpram driver.  Does
>>> this driver still see use?  If so can we make it behave like the
>>> other DAX drivers and require a pgmap?  I think the biggest missing
>>> part would be to implement ARCH_HAS_PTE_DEVMAP for s390.
>>>
>>
>> Gerald,
>>
>> Might you still be looking to help dcssblk get out of FS_DAX_LIMITED
>> jail [1]? I recall Martin saying that 'struct page' overhead was
>> prohibitive. I don't know if Joao's 'struct page' diet patches could
>> help alleviate that at all (would require the filesystem to only
>> allocate blocks in large page sizes).

/me nods

Either that or dynamically remapping the deduplicated tail page vmemmap
areas when we punch holes in what was represented as a compound page (or
when we collapse pages back together). Not sure how crazy the latter is.

>>
>> [1]: https://lore.kernel.org/r/20180523205017.0f2bc83e@thinkpad
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: can we finally kill off CONFIG_FS_DAX_LIMITED
  2021-08-24 14:09       ` Joao Martins
@ 2021-08-24 14:53         ` Dan Williams
  2021-08-24 18:24           ` Gerald Schaefer
  0 siblings, 1 reply; 22+ messages in thread
From: Dan Williams @ 2021-08-24 14:53 UTC (permalink / raw)
  To: Joao Martins
  Cc: Gerald Schaefer, Christoph Hellwig, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Linux NVDIMM, linux-s390

On Tue, Aug 24, 2021 at 7:10 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
>
>
> On 8/23/21 9:21 PM, Dan Williams wrote:
> > On Mon, Aug 23, 2021 at 12:47 PM Gerald Schaefer
> > <gerald.schaefer@linux.ibm.com> wrote:
> >>
> >> On Mon, 23 Aug 2021 16:05:46 +0200
> >> Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:
> >>
> >>> On Fri, 20 Aug 2021 07:43:40 +0200
> >>> Christoph Hellwig <hch@lst.de> wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> looking at the recent ZONE_DEVICE related changes we still have a
> >>>> horrible maze of different code paths.  I already suggested to
> >>>> depend on ARCH_HAS_PTE_SPECIAL for ZONE_DEVICE there, which all modern
> >>
> >> Oh, we do have PTE_SPECIAL, actually that took away the last free bit
> >> in the pte. So, if there is a chance that ZONE_DEVICE would depend
> >> on PTE_SPECIAL instead of PTE_DEVMAP, we might be back in the game
> >> and get rid of that CONFIG_FS_DAX_LIMITED.
> >
> > So PTE_DEVMAP is primarily there to coordinate the
> > get_user_pages_fast() path, and even there it's usage can be
> > eliminated in favor of PTE_SPECIAL. I started that effort [1], but
> > need to rebase on new notify_failure infrastructure coming from Ruan
> > [2]. So I think you are not in the critical path until I can get the
> > PTE_DEVMAP requirement out of your way.
> >
>
> Isn't the implicit case that PTE_SPECIAL means that you
> aren't supposed to get a struct page back? The gup path bails out on
> pte_special() case. And in the fact in this thread that you quote:
>
> > [1]: https://lore.kernel.org/r/161604050866.1463742.7759521510383551055.stgit@dwillia2-desk3.amr.corp.intel.com
>
> (...) we were speaking about[1.1] using that same special bit to block
> longterm gup for fs-dax (while allowing it device-dax which does support it).
>
> [1.1] https://lore.kernel.org/nvdimm/a8c41028-c7f5-9b93-4721-b8ddcf2427da@oracle.com/
>
> Or maybe that's what you mean for this particular case of FS_DAX_LIMITED. Most _special*()
> cases in mm match _devmap*() as far I've experimented in the past with PMD/PUD and dax
> (prior to [1.1]).
>
> I am just wondering would you differentiate the case where you have metadata for the
> !FS_DAX_LIMITED case in {gup,gup_fast} path in light of removing PTE_DEVMAP. I would have
> thought of checking that a pgmap exists for the pfn (without grabbing a ref to it).

So I should clarify, I'm not proposing removing PTE_DEVMAP, I'm
proposing relaxing its need for architectures that can not afford the
PTE bit. Those architectures would miss out on get_user_pages_fast()
for devmap pages. Then, once PTE_SPECIAL kicks get_user_pages() to the
slow path, get_dev_pagemap() is used to detect devmap pages.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: can we finally kill off CONFIG_FS_DAX_LIMITED
  2021-08-24 14:53         ` Dan Williams
@ 2021-08-24 18:24           ` Gerald Schaefer
  2021-08-24 18:44             ` Dan Williams
  0 siblings, 1 reply; 22+ messages in thread
From: Gerald Schaefer @ 2021-08-24 18:24 UTC (permalink / raw)
  To: Dan Williams
  Cc: Joao Martins, Christoph Hellwig, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Linux NVDIMM, linux-s390

On Tue, 24 Aug 2021 07:53:22 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> On Tue, Aug 24, 2021 at 7:10 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> >
> >
> >
> > On 8/23/21 9:21 PM, Dan Williams wrote:
> > > On Mon, Aug 23, 2021 at 12:47 PM Gerald Schaefer
> > > <gerald.schaefer@linux.ibm.com> wrote:
> > >>
> > >> On Mon, 23 Aug 2021 16:05:46 +0200
> > >> Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:
> > >>
> > >>> On Fri, 20 Aug 2021 07:43:40 +0200
> > >>> Christoph Hellwig <hch@lst.de> wrote:
> > >>>
> > >>>> Hi all,
> > >>>>
> > >>>> looking at the recent ZONE_DEVICE related changes we still have a
> > >>>> horrible maze of different code paths.  I already suggested to
> > >>>> depend on ARCH_HAS_PTE_SPECIAL for ZONE_DEVICE there, which all modern
> > >>
> > >> Oh, we do have PTE_SPECIAL, actually that took away the last free bit
> > >> in the pte. So, if there is a chance that ZONE_DEVICE would depend
> > >> on PTE_SPECIAL instead of PTE_DEVMAP, we might be back in the game
> > >> and get rid of that CONFIG_FS_DAX_LIMITED.
> > >
> > > So PTE_DEVMAP is primarily there to coordinate the
> > > get_user_pages_fast() path, and even there it's usage can be
> > > eliminated in favor of PTE_SPECIAL. I started that effort [1], but
> > > need to rebase on new notify_failure infrastructure coming from Ruan
> > > [2]. So I think you are not in the critical path until I can get the
> > > PTE_DEVMAP requirement out of your way.
> > >
> >
> > Isn't the implicit case that PTE_SPECIAL means that you
> > aren't supposed to get a struct page back? The gup path bails out on
> > pte_special() case. And in the fact in this thread that you quote:
> >
> > > [1]: https://lore.kernel.org/r/161604050866.1463742.7759521510383551055.stgit@dwillia2-desk3.amr.corp.intel.com
> >
> > (...) we were speaking about[1.1] using that same special bit to block
> > longterm gup for fs-dax (while allowing it device-dax which does support it).
> >
> > [1.1] https://lore.kernel.org/nvdimm/a8c41028-c7f5-9b93-4721-b8ddcf2427da@oracle.com/
> >
> > Or maybe that's what you mean for this particular case of FS_DAX_LIMITED. Most _special*()
> > cases in mm match _devmap*() as far I've experimented in the past with PMD/PUD and dax
> > (prior to [1.1]).
> >
> > I am just wondering would you differentiate the case where you have metadata for the
> > !FS_DAX_LIMITED case in {gup,gup_fast} path in light of removing PTE_DEVMAP. I would have
> > thought of checking that a pgmap exists for the pfn (without grabbing a ref to it).
> 
> So I should clarify, I'm not proposing removing PTE_DEVMAP, I'm
> proposing relaxing its need for architectures that can not afford the
> PTE bit. Those architectures would miss out on get_user_pages_fast()
> for devmap pages. Then, once PTE_SPECIAL kicks get_user_pages() to the
> slow path, get_dev_pagemap() is used to detect devmap pages.

Thanks, I was also a bit confused, but I think I got it now. Does that mean
that you also plan to relax the pte_devmap(pte) check in follow_page_pte(),
before calling get_dev_pagemap() in the slow path? So that it could also be
called for pte_special(), maybe with additional vma_is_dax() check. And then
rely on get_dev_pagemap() finding the pages for those "very special" PTEs that
actually would have struct pages (at least for s390 DCSS with DAX)?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: can we finally kill off CONFIG_FS_DAX_LIMITED
  2021-08-24 18:24           ` Gerald Schaefer
@ 2021-08-24 18:44             ` Dan Williams
  2021-10-14 23:04               ` Jason Gunthorpe
  0 siblings, 1 reply; 22+ messages in thread
From: Dan Williams @ 2021-08-24 18:44 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Joao Martins, Christoph Hellwig, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Linux NVDIMM, linux-s390

On Tue, Aug 24, 2021 at 11:25 AM Gerald Schaefer
<gerald.schaefer@linux.ibm.com> wrote:
>
> On Tue, 24 Aug 2021 07:53:22 -0700
> Dan Williams <dan.j.williams@intel.com> wrote:
>
> > On Tue, Aug 24, 2021 at 7:10 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> > >
> > >
> > >
> > > On 8/23/21 9:21 PM, Dan Williams wrote:
> > > > On Mon, Aug 23, 2021 at 12:47 PM Gerald Schaefer
> > > > <gerald.schaefer@linux.ibm.com> wrote:
> > > >>
> > > >> On Mon, 23 Aug 2021 16:05:46 +0200
> > > >> Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:
> > > >>
> > > >>> On Fri, 20 Aug 2021 07:43:40 +0200
> > > >>> Christoph Hellwig <hch@lst.de> wrote:
> > > >>>
> > > >>>> Hi all,
> > > >>>>
> > > >>>> looking at the recent ZONE_DEVICE related changes we still have a
> > > >>>> horrible maze of different code paths.  I already suggested to
> > > >>>> depend on ARCH_HAS_PTE_SPECIAL for ZONE_DEVICE there, which all modern
> > > >>
> > > >> Oh, we do have PTE_SPECIAL, actually that took away the last free bit
> > > >> in the pte. So, if there is a chance that ZONE_DEVICE would depend
> > > >> on PTE_SPECIAL instead of PTE_DEVMAP, we might be back in the game
> > > >> and get rid of that CONFIG_FS_DAX_LIMITED.
> > > >
> > > > So PTE_DEVMAP is primarily there to coordinate the
> > > > get_user_pages_fast() path, and even there it's usage can be
> > > > eliminated in favor of PTE_SPECIAL. I started that effort [1], but
> > > > need to rebase on new notify_failure infrastructure coming from Ruan
> > > > [2]. So I think you are not in the critical path until I can get the
> > > > PTE_DEVMAP requirement out of your way.
> > > >
> > >
> > > Isn't the implicit case that PTE_SPECIAL means that you
> > > aren't supposed to get a struct page back? The gup path bails out on
> > > pte_special() case. And in the fact in this thread that you quote:
> > >
> > > > [1]: https://lore.kernel.org/r/161604050866.1463742.7759521510383551055.stgit@dwillia2-desk3.amr.corp.intel.com
> > >
> > > (...) we were speaking about[1.1] using that same special bit to block
> > > longterm gup for fs-dax (while allowing it device-dax which does support it).
> > >
> > > [1.1] https://lore.kernel.org/nvdimm/a8c41028-c7f5-9b93-4721-b8ddcf2427da@oracle.com/
> > >
> > > Or maybe that's what you mean for this particular case of FS_DAX_LIMITED. Most _special*()
> > > cases in mm match _devmap*() as far I've experimented in the past with PMD/PUD and dax
> > > (prior to [1.1]).
> > >
> > > I am just wondering would you differentiate the case where you have metadata for the
> > > !FS_DAX_LIMITED case in {gup,gup_fast} path in light of removing PTE_DEVMAP. I would have
> > > thought of checking that a pgmap exists for the pfn (without grabbing a ref to it).
> >
> > So I should clarify, I'm not proposing removing PTE_DEVMAP, I'm
> > proposing relaxing its need for architectures that can not afford the
> > PTE bit. Those architectures would miss out on get_user_pages_fast()
> > for devmap pages. Then, once PTE_SPECIAL kicks get_user_pages() to the
> > slow path, get_dev_pagemap() is used to detect devmap pages.
>
> Thanks, I was also a bit confused, but I think I got it now. Does that mean
> that you also plan to relax the pte_devmap(pte) check in follow_page_pte(),
> before calling get_dev_pagemap() in the slow path? So that it could also be
> called for pte_special(), maybe with additional vma_is_dax() check. And then
> rely on get_dev_pagemap() finding the pages for those "very special" PTEs that
> actually would have struct pages (at least for s390 DCSS with DAX)?

Yes, that's along the lines of what I'm thinking. I.e don't expect
pte_devmap() to be there in the slow path, and use the vma to check
for DAX.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: can we finally kill off CONFIG_FS_DAX_LIMITED
  2021-08-24 18:44             ` Dan Williams
@ 2021-10-14 23:04               ` Jason Gunthorpe
  2021-10-15  0:22                 ` Joao Martins
  0 siblings, 1 reply; 22+ messages in thread
From: Jason Gunthorpe @ 2021-10-14 23:04 UTC (permalink / raw)
  To: Dan Williams
  Cc: Gerald Schaefer, Joao Martins, Christoph Hellwig, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Linux NVDIMM, linux-s390,
	Matthew Wilcox, Alex Sierra, Kuehling, Felix, Linux MM,
	Ralph Campbell, Alistair Popple, Vishal Verma, Dave Jiang

On Tue, Aug 24, 2021 at 11:44:20AM -0700, Dan Williams wrote:

> Yes, that's along the lines of what I'm thinking. I.e don't expect
> pte_devmap() to be there in the slow path, and use the vma to check
> for DAX.

I think we should delete pte_devmap completely from gup.c.

It is doing a few things that are better done in more general ways:

1) Doing the get_dev_pagemap() stuff which should be entirely deleted
   from gup.c in favour of proper use of struct page references.

2) Denying FOLL_LONGTERM
   Once GUP has grabbed the page we can call is_zone_device_page() on
   the struct page. If true we can check page->pgmap and read some
   DENY_FOLL_LONGTERM flag from there

3) Different refcounts for pud/pmd pages

   Ideally DAX cases would not do this (ie Joao is fixing device-dax)
   but in the interm we can just loop over the PUD/PMD in all
   cases. Looping is safe for THP AFAIK. I described how this can work
   here:

   https://lore.kernel.org/all/20211013174140.GJ2744544@nvidia.com/

After that there are only two remaining uses:

4) The pud/pmd_devmap() in vm_normal_page() should just go
   away. ZONE_DEVICE memory with struct pages SHOULD be a normal
   page. This also means dropping pte_special too.

5) dev_pagemap_mapping_shift() - I don't know what this does
   but why not use the is_zone_device_page() approach from 2?

In this way ZONE_DEVICE pages can be fully normal pages with no
requirements on PTE flags.

Where have I gone wrong? :)

pud/pmd_devmap() looks a little more involved to remove, but I wonder
if we can change logic like this:

	if (pmd_trans_huge(*vmf->pmd) || pmd_devmap(*vmf->pmd)) {

Into

  if (pmd_is_page(*pmd))

? And rely on struct page based stuff as above to discern THP vs devmap?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: can we finally kill off CONFIG_FS_DAX_LIMITED
  2021-10-14 23:04               ` Jason Gunthorpe
@ 2021-10-15  0:22                 ` Joao Martins
  2021-10-18 23:30                   ` Jason Gunthorpe
  0 siblings, 1 reply; 22+ messages in thread
From: Joao Martins @ 2021-10-15  0:22 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams
  Cc: Gerald Schaefer, Christoph Hellwig, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Linux NVDIMM, linux-s390,
	Matthew Wilcox, Alex Sierra, Kuehling, Felix, Linux MM,
	Ralph Campbell, Alistair Popple, Vishal Verma, Dave Jiang

On 10/15/21 00:04, Jason Gunthorpe wrote:
> 2) Denying FOLL_LONGTERM
>    Once GUP has grabbed the page we can call is_zone_device_page() on
>    the struct page. If true we can check page->pgmap and read some
>    DENY_FOLL_LONGTERM flag from there
> 
I had proposed something similar to that:

https://lore.kernel.org/linux-mm/6a18179e-65f7-367d-89a9-d5162f10fef0@oracle.com/

Albeit I was using pgmap->type and was relying on get_dev_pagemap() ref
as opposed to after grabbing the page. I can ressurect that with some
adjustments to use pgmap flags to check DENY_LONGTERM flag (and set it
on fsdax[*]) and move the check to after try_grab_page(). That is provided
the other alternative with special page bit isn't an option anymore.

[*] which begs the question on whether fsdax is the *only* that needs the flag?

> 3) Different refcounts for pud/pmd pages
> 
>    Ideally DAX cases would not do this (ie Joao is fixing device-dax)
>    but in the interm we can just loop over the PUD/PMD in all
>    cases. Looping is safe for THP AFAIK. I described how this can work
>    here:
> 
>    https://lore.kernel.org/all/20211013174140.GJ2744544@nvidia.com/
> 
> After that there are only two remaining uses:
> 
> 4) The pud/pmd_devmap() in vm_normal_page() should just go
>    away. ZONE_DEVICE memory with struct pages SHOULD be a normal
>    page. This also means dropping pte_special too.
> 
> 5) dev_pagemap_mapping_shift() - I don't know what this does
>    but why not use the is_zone_device_page() approach from 2?
> 
dev_pagemap_mapping_shift() does a lookup to figure out
which order is the page table entry represents. is_zone_device_page()
is already used to gate usage of dev_pagemap_mapping_shift(). I think
this might be an artifact of the same issue as 3) in which PMDs/PUDs
are represented with base pages and hence you can't do what the rest
of the world does with:

	tk->size_shift = page_shift(compound_head(p));

... as page_shift() would just return PAGE_SHIFT (as compound_order() is 0).

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: can we finally kill off CONFIG_FS_DAX_LIMITED
  2021-10-15  0:22                 ` Joao Martins
@ 2021-10-18 23:30                   ` Jason Gunthorpe
  2021-10-19  4:26                     ` Dan Williams
  0 siblings, 1 reply; 22+ messages in thread
From: Jason Gunthorpe @ 2021-10-18 23:30 UTC (permalink / raw)
  To: Joao Martins
  Cc: Dan Williams, Gerald Schaefer, Christoph Hellwig, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Linux NVDIMM, linux-s390,
	Matthew Wilcox, Alex Sierra, Kuehling, Felix, Linux MM,
	Ralph Campbell, Alistair Popple, Vishal Verma, Dave Jiang

On Fri, Oct 15, 2021 at 01:22:41AM +0100, Joao Martins wrote:

> dev_pagemap_mapping_shift() does a lookup to figure out
> which order is the page table entry represents. is_zone_device_page()
> is already used to gate usage of dev_pagemap_mapping_shift(). I think
> this might be an artifact of the same issue as 3) in which PMDs/PUDs
> are represented with base pages and hence you can't do what the rest
> of the world does with:

This code is looks broken as written.

vma_address() relies on certain properties that I maybe DAX (maybe
even only FSDAX?) sets on its ZONE_DEVICE pages, and
dev_pagemap_mapping_shift() does not handle the -EFAULT return. It
will crash if a memory failure hits any other kind of ZONE_DEVICE
area.

I'm not sure the comment is correct anyhow:

		/*
		 * Unmap the largest mapping to avoid breaking up
		 * device-dax mappings which are constant size. The
		 * actual size of the mapping being torn down is
		 * communicated in siginfo, see kill_proc()
		 */
		unmap_mapping_range(page->mapping, start, size, 0);

Beacuse for non PageAnon unmap_mapping_range() does either
zap_huge_pud(), __split_huge_pmd(), or zap_huge_pmd().

Despite it's name __split_huge_pmd() does not actually split, it will
call __split_huge_pmd_locked:

	} else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd)))
		goto out;
	__split_huge_pmd_locked(vma, pmd, range.start, freeze);

Which does
	if (!vma_is_anonymous(vma)) {
		old_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);

Which is a zap, not split.

So I wonder if there is a reason to use anything other than 4k here
for DAX?

> 	tk->size_shift = page_shift(compound_head(p));
> 
> ... as page_shift() would just return PAGE_SHIFT (as compound_order() is 0).

And what would be so wrong with memory failure doing this as a 4k
page?

Jason

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: can we finally kill off CONFIG_FS_DAX_LIMITED
  2021-10-18 23:30                   ` Jason Gunthorpe
@ 2021-10-19  4:26                     ` Dan Williams
  2021-10-19 14:20                       ` Jason Gunthorpe
  0 siblings, 1 reply; 22+ messages in thread
From: Dan Williams @ 2021-10-19  4:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Joao Martins, Gerald Schaefer, Christoph Hellwig, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Linux NVDIMM, linux-s390,
	Matthew Wilcox, Alex Sierra, Kuehling, Felix, Linux MM,
	Ralph Campbell, Alistair Popple, Vishal Verma, Dave Jiang

On Mon, Oct 18, 2021 at 4:31 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Fri, Oct 15, 2021 at 01:22:41AM +0100, Joao Martins wrote:
>
> > dev_pagemap_mapping_shift() does a lookup to figure out
> > which order is the page table entry represents. is_zone_device_page()
> > is already used to gate usage of dev_pagemap_mapping_shift(). I think
> > this might be an artifact of the same issue as 3) in which PMDs/PUDs
> > are represented with base pages and hence you can't do what the rest
> > of the world does with:
>
> This code is looks broken as written.
>
> vma_address() relies on certain properties that I maybe DAX (maybe
> even only FSDAX?) sets on its ZONE_DEVICE pages, and
> dev_pagemap_mapping_shift() does not handle the -EFAULT return. It
> will crash if a memory failure hits any other kind of ZONE_DEVICE
> area.

That case is gated with a TODO in memory_failure_dev_pagemap(). I
never got any response to queries about what to do about memory
failure vs HMM.

>
> I'm not sure the comment is correct anyhow:
>
>                 /*
>                  * Unmap the largest mapping to avoid breaking up
>                  * device-dax mappings which are constant size. The
>                  * actual size of the mapping being torn down is
>                  * communicated in siginfo, see kill_proc()
>                  */
>                 unmap_mapping_range(page->mapping, start, size, 0);
>
> Beacuse for non PageAnon unmap_mapping_range() does either
> zap_huge_pud(), __split_huge_pmd(), or zap_huge_pmd().
>
> Despite it's name __split_huge_pmd() does not actually split, it will
> call __split_huge_pmd_locked:
>
>         } else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd)))
>                 goto out;
>         __split_huge_pmd_locked(vma, pmd, range.start, freeze);
>
> Which does
>         if (!vma_is_anonymous(vma)) {
>                 old_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
>
> Which is a zap, not split.
>
> So I wonder if there is a reason to use anything other than 4k here
> for DAX?
>
> >       tk->size_shift = page_shift(compound_head(p));
> >
> > ... as page_shift() would just return PAGE_SHIFT (as compound_order() is 0).
>
> And what would be so wrong with memory failure doing this as a 4k
> page?

device-dax does not support misaligned mappings. It makes hard
guarantees for applications that can not afford the page table
allocation overhead of sub-1GB mappings.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: can we finally kill off CONFIG_FS_DAX_LIMITED
  2021-10-19  4:26                     ` Dan Williams
@ 2021-10-19 14:20                       ` Jason Gunthorpe
  2021-10-19 15:20                         ` Joao Martins
                                           ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Jason Gunthorpe @ 2021-10-19 14:20 UTC (permalink / raw)
  To: Dan Williams
  Cc: Joao Martins, Gerald Schaefer, Christoph Hellwig, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Linux NVDIMM, linux-s390,
	Matthew Wilcox, Alex Sierra, Kuehling, Felix, Linux MM,
	Ralph Campbell, Alistair Popple, Vishal Verma, Dave Jiang

On Mon, Oct 18, 2021 at 09:26:24PM -0700, Dan Williams wrote:
> On Mon, Oct 18, 2021 at 4:31 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Fri, Oct 15, 2021 at 01:22:41AM +0100, Joao Martins wrote:
> >
> > > dev_pagemap_mapping_shift() does a lookup to figure out
> > > which order is the page table entry represents. is_zone_device_page()
> > > is already used to gate usage of dev_pagemap_mapping_shift(). I think
> > > this might be an artifact of the same issue as 3) in which PMDs/PUDs
> > > are represented with base pages and hence you can't do what the rest
> > > of the world does with:
> >
> > This code is looks broken as written.
> >
> > vma_address() relies on certain properties that I maybe DAX (maybe
> > even only FSDAX?) sets on its ZONE_DEVICE pages, and
> > dev_pagemap_mapping_shift() does not handle the -EFAULT return. It
> > will crash if a memory failure hits any other kind of ZONE_DEVICE
> > area.
> 
> That case is gated with a TODO in memory_failure_dev_pagemap(). I
> never got any response to queries about what to do about memory
> failure vs HMM.

Unfortunately neither Logan nor Felix noticed that TODO conditional
when adding new types..

But maybe it is dead code anyhow as it already has this:

	cookie = dax_lock_page(page);
	if (!cookie)
		goto out;

Right before? Doesn't that already always fail for anything that isn't
a DAX?

> > I'm not sure the comment is correct anyhow:
> >
> >                 /*
> >                  * Unmap the largest mapping to avoid breaking up
> >                  * device-dax mappings which are constant size. The
> >                  * actual size of the mapping being torn down is
> >                  * communicated in siginfo, see kill_proc()
> >                  */
> >                 unmap_mapping_range(page->mapping, start, size, 0);
> >
> > Beacuse for non PageAnon unmap_mapping_range() does either
> > zap_huge_pud(), __split_huge_pmd(), or zap_huge_pmd().
> >
> > Despite it's name __split_huge_pmd() does not actually split, it will
> > call __split_huge_pmd_locked:
> >
> >         } else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd)))
> >                 goto out;
> >         __split_huge_pmd_locked(vma, pmd, range.start, freeze);
> >
> > Which does
> >         if (!vma_is_anonymous(vma)) {
> >                 old_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
> >
> > Which is a zap, not split.
> >
> > So I wonder if there is a reason to use anything other than 4k here
> > for DAX?
> >
> > >       tk->size_shift = page_shift(compound_head(p));
> > >
> > > ... as page_shift() would just return PAGE_SHIFT (as compound_order() is 0).
> >
> > And what would be so wrong with memory failure doing this as a 4k
> > page?
> 
> device-dax does not support misaligned mappings. It makes hard
> guarantees for applications that can not afford the page table
> allocation overhead of sub-1GB mappings.

memory-failure is the wrong layer to enforce this anyhow - if someday
unmap_mapping_range() did learn to break up the 1GB pages then we'd
want to put the condition to preserve device-dax mappings there, not
way up in memory-failure.

So we can just delete the detection of the page size and rely on the
zap code to wipe out the entire level, not split it. Which is what we
have today already.

Jason

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: can we finally kill off CONFIG_FS_DAX_LIMITED
  2021-10-19 14:20                       ` Jason Gunthorpe
@ 2021-10-19 15:20                         ` Joao Martins
  2021-10-19 15:38                         ` Felix Kuehling
  2021-10-19 17:38                         ` Dan Williams
  2 siblings, 0 replies; 22+ messages in thread
From: Joao Martins @ 2021-10-19 15:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Gerald Schaefer, Christoph Hellwig, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Linux NVDIMM, linux-s390,
	Matthew Wilcox, Alex Sierra, Kuehling, Felix, Linux MM,
	Ralph Campbell, Alistair Popple, Vishal Verma, Dave Jiang,
	Dan Williams

On 10/19/21 15:20, Jason Gunthorpe wrote:
> On Mon, Oct 18, 2021 at 09:26:24PM -0700, Dan Williams wrote:
>> On Mon, Oct 18, 2021 at 4:31 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>>> On Fri, Oct 15, 2021 at 01:22:41AM +0100, Joao Martins wrote:
>>> I'm not sure the comment is correct anyhow:
>>>
>>>                 /*
>>>                  * Unmap the largest mapping to avoid breaking up
>>>                  * device-dax mappings which are constant size. The
>>>                  * actual size of the mapping being torn down is
>>>                  * communicated in siginfo, see kill_proc()
>>>                  */
>>>                 unmap_mapping_range(page->mapping, start, size, 0);
>>>
>>> Beacuse for non PageAnon unmap_mapping_range() does either
>>> zap_huge_pud(), __split_huge_pmd(), or zap_huge_pmd().
>>>
>>> Despite it's name __split_huge_pmd() does not actually split, it will
>>> call __split_huge_pmd_locked:
>>>
>>>         } else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd)))
>>>                 goto out;
>>>         __split_huge_pmd_locked(vma, pmd, range.start, freeze);
>>>
>>> Which does
>>>         if (!vma_is_anonymous(vma)) {
>>>                 old_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
>>>
>>> Which is a zap, not split.
>>>
>>> So I wonder if there is a reason to use anything other than 4k here
>>> for DAX?
>>>
>>>>       tk->size_shift = page_shift(compound_head(p));
>>>>
>>>> ... as page_shift() would just return PAGE_SHIFT (as compound_order() is 0).
>>>
>>> And what would be so wrong with memory failure doing this as a 4k
>>> page?
>>
>> device-dax does not support misaligned mappings. It makes hard
>> guarantees for applications that can not afford the page table
>> allocation overhead of sub-1GB mappings.
> 
> memory-failure is the wrong layer to enforce this anyhow - if someday
> unmap_mapping_range() did learn to break up the 1GB pages then we'd
> want to put the condition to preserve device-dax mappings there, not
> way up in memory-failure.
> 
> So we can just delete the detection of the page size and rely on the
> zap code to wipe out the entire level, not split it. Which is what we
> have today already.

On a quick note, wrt to @size_shift: memory-failure reflects it back to
userspace as contextual information (::addr_lsb) of the signal, when delivering
the intended SIGBUS(code=BUS_MCEERR_*). So the size needs to be reported
somehow.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: can we finally kill off CONFIG_FS_DAX_LIMITED
  2021-10-19 14:20                       ` Jason Gunthorpe
  2021-10-19 15:20                         ` Joao Martins
@ 2021-10-19 15:38                         ` Felix Kuehling
  2021-10-19 17:38                         ` Dan Williams
  2 siblings, 0 replies; 22+ messages in thread
From: Felix Kuehling @ 2021-10-19 15:38 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams
  Cc: Joao Martins, Gerald Schaefer, Christoph Hellwig, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Linux NVDIMM, linux-s390,
	Matthew Wilcox, Alex Sierra, Linux MM, Ralph Campbell,
	Alistair Popple, Vishal Verma, Dave Jiang, Phillips, Daniel

Am 2021-10-19 um 10:20 a.m. schrieb Jason Gunthorpe:
> On Mon, Oct 18, 2021 at 09:26:24PM -0700, Dan Williams wrote:
>> On Mon, Oct 18, 2021 at 4:31 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>>> On Fri, Oct 15, 2021 at 01:22:41AM +0100, Joao Martins wrote:
>>>
>>>> dev_pagemap_mapping_shift() does a lookup to figure out
>>>> which order is the page table entry represents. is_zone_device_page()
>>>> is already used to gate usage of dev_pagemap_mapping_shift(). I think
>>>> this might be an artifact of the same issue as 3) in which PMDs/PUDs
>>>> are represented with base pages and hence you can't do what the rest
>>>> of the world does with:
>>> This code is looks broken as written.
>>>
>>> vma_address() relies on certain properties that I maybe DAX (maybe
>>> even only FSDAX?) sets on its ZONE_DEVICE pages, and
>>> dev_pagemap_mapping_shift() does not handle the -EFAULT return. It
>>> will crash if a memory failure hits any other kind of ZONE_DEVICE
>>> area.
>> That case is gated with a TODO in memory_failure_dev_pagemap(). I
>> never got any response to queries about what to do about memory
>> failure vs HMM.
> Unfortunately neither Logan nor Felix noticed that TODO conditional
> when adding new types..

You mean this?

        if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
                /*
                 * TODO: Handle HMM pages which may need coordination
                 * with device-side memory.
                 */
                goto unlock;
        }

Yeah, I never looked at that. Alex, we'll need to add || pgmap->type ==
MEMORY_DEVICE_COHERENT here. Or should we change this into a test that
looks for the pgmap->types that are actually handled by
memory_failure_dev_pagemap? E.g.

        if (pgmap->type != MEMORY_DEVICE_FS_DAX)
                goto unlock;

I think in case of a real HW error, our driver should be calling
memory_failure. But then a callback from here back into the driver
wouldn't make sense.

For MADV_HWPOISON we may need a callback to the driver, if we want the
driver to treat it like an actual HW error and retire the page.


>
> But maybe it is dead code anyhow as it already has this:
>
> 	cookie = dax_lock_page(page);
> 	if (!cookie)
> 		goto out;
>
> Right before? Doesn't that already always fail for anything that isn't
> a DAX?

I guess the check for the pgmap->type should come before this.

Regards,
  Felix


>
>>> I'm not sure the comment is correct anyhow:
>>>
>>>                 /*
>>>                  * Unmap the largest mapping to avoid breaking up
>>>                  * device-dax mappings which are constant size. The
>>>                  * actual size of the mapping being torn down is
>>>                  * communicated in siginfo, see kill_proc()
>>>                  */
>>>                 unmap_mapping_range(page->mapping, start, size, 0);
>>>
>>> Beacuse for non PageAnon unmap_mapping_range() does either
>>> zap_huge_pud(), __split_huge_pmd(), or zap_huge_pmd().
>>>
>>> Despite it's name __split_huge_pmd() does not actually split, it will
>>> call __split_huge_pmd_locked:
>>>
>>>         } else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd)))
>>>                 goto out;
>>>         __split_huge_pmd_locked(vma, pmd, range.start, freeze);
>>>
>>> Which does
>>>         if (!vma_is_anonymous(vma)) {
>>>                 old_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
>>>
>>> Which is a zap, not split.
>>>
>>> So I wonder if there is a reason to use anything other than 4k here
>>> for DAX?
>>>
>>>>       tk->size_shift = page_shift(compound_head(p));
>>>>
>>>> ... as page_shift() would just return PAGE_SHIFT (as compound_order() is 0).
>>> And what would be so wrong with memory failure doing this as a 4k
>>> page?
>> device-dax does not support misaligned mappings. It makes hard
>> guarantees for applications that can not afford the page table
>> allocation overhead of sub-1GB mappings.
> memory-failure is the wrong layer to enforce this anyhow - if someday
> unmap_mapping_range() did learn to break up the 1GB pages then we'd
> want to put the condition to preserve device-dax mappings there, not
> way up in memory-failure.
>
> So we can just delete the detection of the page size and rely on the
> zap code to wipe out the entire level, not split it. Which is what we
> have today already.
>
> Jason

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: can we finally kill off CONFIG_FS_DAX_LIMITED
  2021-10-19 14:20                       ` Jason Gunthorpe
  2021-10-19 15:20                         ` Joao Martins
  2021-10-19 15:38                         ` Felix Kuehling
@ 2021-10-19 17:38                         ` Dan Williams
  2021-10-19 17:54                           ` Jason Gunthorpe
  2 siblings, 1 reply; 22+ messages in thread
From: Dan Williams @ 2021-10-19 17:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Joao Martins, Gerald Schaefer, Christoph Hellwig, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Linux NVDIMM, linux-s390,
	Matthew Wilcox, Alex Sierra, Kuehling, Felix, Linux MM,
	Ralph Campbell, Alistair Popple, Vishal Verma, Dave Jiang

On Tue, Oct 19, 2021 at 7:25 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Mon, Oct 18, 2021 at 09:26:24PM -0700, Dan Williams wrote:
> > On Mon, Oct 18, 2021 at 4:31 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Fri, Oct 15, 2021 at 01:22:41AM +0100, Joao Martins wrote:
> > >
> > > > dev_pagemap_mapping_shift() does a lookup to figure out
> > > > which order is the page table entry represents. is_zone_device_page()
> > > > is already used to gate usage of dev_pagemap_mapping_shift(). I think
> > > > this might be an artifact of the same issue as 3) in which PMDs/PUDs
> > > > are represented with base pages and hence you can't do what the rest
> > > > of the world does with:
> > >
> > > This code is looks broken as written.
> > >
> > > vma_address() relies on certain properties that I maybe DAX (maybe
> > > even only FSDAX?) sets on its ZONE_DEVICE pages, and
> > > dev_pagemap_mapping_shift() does not handle the -EFAULT return. It
> > > will crash if a memory failure hits any other kind of ZONE_DEVICE
> > > area.
> >
> > That case is gated with a TODO in memory_failure_dev_pagemap(). I
> > never got any response to queries about what to do about memory
> > failure vs HMM.
>
> Unfortunately neither Logan nor Felix noticed that TODO conditional
> when adding new types..
>
> But maybe it is dead code anyhow as it already has this:
>
>         cookie = dax_lock_page(page);
>         if (!cookie)
>                 goto out;
>
> Right before? Doesn't that already always fail for anything that isn't
> a DAX?

Yes, I originally made that ordering mistake in:

6100e34b2526 mm, memory_failure: Teach memory_failure() about dev_pagemap pages

...however, if we complete the move away from page-less DAX it also
allows for the locking to move from the xarray to lock_page(). I.e.
dax_lock_page() is pinning the inode after the fact, but I suspect the
inode should have been pinned when the mapping was established. Which
raises a question for the reflink support whether it is pinning all
involved inodes while the mapping is established?

>
> > > I'm not sure the comment is correct anyhow:
> > >
> > >                 /*
> > >                  * Unmap the largest mapping to avoid breaking up
> > >                  * device-dax mappings which are constant size. The
> > >                  * actual size of the mapping being torn down is
> > >                  * communicated in siginfo, see kill_proc()
> > >                  */
> > >                 unmap_mapping_range(page->mapping, start, size, 0);
> > >
> > > Beacuse for non PageAnon unmap_mapping_range() does either
> > > zap_huge_pud(), __split_huge_pmd(), or zap_huge_pmd().
> > >
> > > Despite it's name __split_huge_pmd() does not actually split, it will
> > > call __split_huge_pmd_locked:
> > >
> > >         } else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd)))
> > >                 goto out;
> > >         __split_huge_pmd_locked(vma, pmd, range.start, freeze);
> > >
> > > Which does
> > >         if (!vma_is_anonymous(vma)) {
> > >                 old_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
> > >
> > > Which is a zap, not split.
> > >
> > > So I wonder if there is a reason to use anything other than 4k here
> > > for DAX?
> > >
> > > >       tk->size_shift = page_shift(compound_head(p));
> > > >
> > > > ... as page_shift() would just return PAGE_SHIFT (as compound_order() is 0).
> > >
> > > And what would be so wrong with memory failure doing this as a 4k
> > > page?
> >
> > device-dax does not support misaligned mappings. It makes hard
> > guarantees for applications that can not afford the page table
> > allocation overhead of sub-1GB mappings.
>
> memory-failure is the wrong layer to enforce this anyhow - if someday
> unmap_mapping_range() did learn to break up the 1GB pages then we'd
> want to put the condition to preserve device-dax mappings there, not
> way up in memory-failure.
>
> So we can just delete the detection of the page size and rely on the
> zap code to wipe out the entire level, not split it. Which is what we
> have today already.

As Joao points out, userspace wants to know the blast radius of the
unmap for historical reasons. I do think it's worth deprecating that
somehow... providing a better error management interface is part of
the DAX-reflink enabling.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: can we finally kill off CONFIG_FS_DAX_LIMITED
  2021-10-19 17:38                         ` Dan Williams
@ 2021-10-19 17:54                           ` Jason Gunthorpe
  0 siblings, 0 replies; 22+ messages in thread
From: Jason Gunthorpe @ 2021-10-19 17:54 UTC (permalink / raw)
  To: Dan Williams
  Cc: Joao Martins, Gerald Schaefer, Christoph Hellwig, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Linux NVDIMM, linux-s390,
	Matthew Wilcox, Alex Sierra, Kuehling, Felix, Linux MM,
	Ralph Campbell, Alistair Popple, Vishal Verma, Dave Jiang

On Tue, Oct 19, 2021 at 10:38:42AM -0700, Dan Williams wrote:

> > So we can just delete the detection of the page size and rely on the
> > zap code to wipe out the entire level, not split it. Which is what we
> > have today already.
> 
> As Joao points out, userspace wants to know the blast radius of the
> unmap for historical reasons. I do think it's worth deprecating that
> somehow... providing a better error management interface is part of
> the DAX-reflink enabling.

OK, it makes sense.

I have a less invasive idea though - emulate what zap is doing:

      if (!pud_present(*pud))
               return 0;
      if (pud_leaf(*pud))
             return PUD_SHIFT;

      if (!pmd_present(*pud))
               return 0;
      if (pmd_leaf(*pud))
             return PMD_SHIFT;
      return PAGE_SHIFT;

Which would return the "blast radius" of the unmap_mapping_range()
when it rounds up to the left page level that contains the VA.

Now it doesn't need the pte_devmap test..

And when both DAX's learn to use compound_head this can be deleted.

Jason

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-10-19 17:54 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-20  5:43 can we finally kill off CONFIG_FS_DAX_LIMITED Christoph Hellwig
2021-08-20 15:41 ` Dan Williams
2021-08-20 17:42   ` Dan Williams
2021-08-20 19:03     ` Gerald Schaefer
2021-08-24 14:17     ` Joao Martins
2021-08-23 14:05 ` Gerald Schaefer
2021-08-23 19:47   ` Gerald Schaefer
2021-08-23 20:21     ` Dan Williams
2021-08-24 14:09       ` Joao Martins
2021-08-24 14:53         ` Dan Williams
2021-08-24 18:24           ` Gerald Schaefer
2021-08-24 18:44             ` Dan Williams
2021-10-14 23:04               ` Jason Gunthorpe
2021-10-15  0:22                 ` Joao Martins
2021-10-18 23:30                   ` Jason Gunthorpe
2021-10-19  4:26                     ` Dan Williams
2021-10-19 14:20                       ` Jason Gunthorpe
2021-10-19 15:20                         ` Joao Martins
2021-10-19 15:38                         ` Felix Kuehling
2021-10-19 17:38                         ` Dan Williams
2021-10-19 17:54                           ` Jason Gunthorpe
2021-08-24  6:49   ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).