* Re: can we finally kill off CONFIG_FS_DAX_LIMITED [not found] ` <CAPcyv4iFeVDVPn6uc=aKsyUvkiu3-fK-N16iJVZQ3N8oT00hWA@mail.gmail.com> @ 2021-10-14 23:04 ` Jason Gunthorpe 2021-10-15 0:22 ` Joao Martins 0 siblings, 1 reply; 9+ messages in thread From: Jason Gunthorpe @ 2021-10-14 23:04 UTC (permalink / raw) To: Dan Williams Cc: Gerald Schaefer, Joao Martins, Christoph Hellwig, Heiko Carstens, Vasily Gorbik, Christian Borntraeger, Linux NVDIMM, linux-s390, Matthew Wilcox, Alex Sierra, Kuehling, Felix, Linux MM, Ralph Campbell, Alistair Popple, Vishal Verma, Dave Jiang On Tue, Aug 24, 2021 at 11:44:20AM -0700, Dan Williams wrote: > Yes, that's along the lines of what I'm thinking. I.e don't expect > pte_devmap() to be there in the slow path, and use the vma to check > for DAX. I think we should delete pte_devmap completely from gup.c. It is doing a few things that are better done in more general ways: 1) Doing the get_dev_pagemap() stuff which should be entirely deleted from gup.c in favour of proper use of struct page references. 2) Denying FOLL_LONGTERM Once GUP has grabbed the page we can call is_zone_device_page() on the struct page. If true we can check page->pgmap and read some DENY_FOLL_LONGTERM flag from there 3) Different refcounts for pud/pmd pages Ideally DAX cases would not do this (ie Joao is fixing device-dax) but in the interm we can just loop over the PUD/PMD in all cases. Looping is safe for THP AFAIK. I described how this can work here: https://lore.kernel.org/all/20211013174140.GJ2744544@nvidia.com/ After that there are only two remaining uses: 4) The pud/pmd_devmap() in vm_normal_page() should just go away. ZONE_DEVICE memory with struct pages SHOULD be a normal page. This also means dropping pte_special too. 5) dev_pagemap_mapping_shift() - I don't know what this does but why not use the is_zone_device_page() approach from 2? In this way ZONE_DEVICE pages can be fully normal pages with no requirements on PTE flags. Where have I gone wrong? :) pud/pmd_devmap() looks a little more involved to remove, but I wonder if we can change logic like this: if (pmd_trans_huge(*vmf->pmd) || pmd_devmap(*vmf->pmd)) { Into if (pmd_is_page(*pmd)) ? And rely on struct page based stuff as above to discern THP vs devmap? Thanks, Jason ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: can we finally kill off CONFIG_FS_DAX_LIMITED 2021-10-14 23:04 ` can we finally kill off CONFIG_FS_DAX_LIMITED Jason Gunthorpe @ 2021-10-15 0:22 ` Joao Martins 2021-10-18 23:30 ` Jason Gunthorpe 0 siblings, 1 reply; 9+ messages in thread From: Joao Martins @ 2021-10-15 0:22 UTC (permalink / raw) To: Jason Gunthorpe, Dan Williams Cc: Gerald Schaefer, Christoph Hellwig, Heiko Carstens, Vasily Gorbik, Christian Borntraeger, Linux NVDIMM, linux-s390, Matthew Wilcox, Alex Sierra, Kuehling, Felix, Linux MM, Ralph Campbell, Alistair Popple, Vishal Verma, Dave Jiang On 10/15/21 00:04, Jason Gunthorpe wrote: > 2) Denying FOLL_LONGTERM > Once GUP has grabbed the page we can call is_zone_device_page() on > the struct page. If true we can check page->pgmap and read some > DENY_FOLL_LONGTERM flag from there > I had proposed something similar to that: https://lore.kernel.org/linux-mm/6a18179e-65f7-367d-89a9-d5162f10fef0@oracle.com/ Albeit I was using pgmap->type and was relying on get_dev_pagemap() ref as opposed to after grabbing the page. I can ressurect that with some adjustments to use pgmap flags to check DENY_LONGTERM flag (and set it on fsdax[*]) and move the check to after try_grab_page(). That is provided the other alternative with special page bit isn't an option anymore. [*] which begs the question on whether fsdax is the *only* that needs the flag? > 3) Different refcounts for pud/pmd pages > > Ideally DAX cases would not do this (ie Joao is fixing device-dax) > but in the interm we can just loop over the PUD/PMD in all > cases. Looping is safe for THP AFAIK. I described how this can work > here: > > https://lore.kernel.org/all/20211013174140.GJ2744544@nvidia.com/ > > After that there are only two remaining uses: > > 4) The pud/pmd_devmap() in vm_normal_page() should just go > away. ZONE_DEVICE memory with struct pages SHOULD be a normal > page. This also means dropping pte_special too. > > 5) dev_pagemap_mapping_shift() - I don't know what this does > but why not use the is_zone_device_page() approach from 2? > dev_pagemap_mapping_shift() does a lookup to figure out which order is the page table entry represents. is_zone_device_page() is already used to gate usage of dev_pagemap_mapping_shift(). I think this might be an artifact of the same issue as 3) in which PMDs/PUDs are represented with base pages and hence you can't do what the rest of the world does with: tk->size_shift = page_shift(compound_head(p)); ... as page_shift() would just return PAGE_SHIFT (as compound_order() is 0). ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: can we finally kill off CONFIG_FS_DAX_LIMITED 2021-10-15 0:22 ` Joao Martins @ 2021-10-18 23:30 ` Jason Gunthorpe 2021-10-19 4:26 ` Dan Williams 0 siblings, 1 reply; 9+ messages in thread From: Jason Gunthorpe @ 2021-10-18 23:30 UTC (permalink / raw) To: Joao Martins Cc: Dan Williams, Gerald Schaefer, Christoph Hellwig, Heiko Carstens, Vasily Gorbik, Christian Borntraeger, Linux NVDIMM, linux-s390, Matthew Wilcox, Alex Sierra, Kuehling, Felix, Linux MM, Ralph Campbell, Alistair Popple, Vishal Verma, Dave Jiang On Fri, Oct 15, 2021 at 01:22:41AM +0100, Joao Martins wrote: > dev_pagemap_mapping_shift() does a lookup to figure out > which order is the page table entry represents. is_zone_device_page() > is already used to gate usage of dev_pagemap_mapping_shift(). I think > this might be an artifact of the same issue as 3) in which PMDs/PUDs > are represented with base pages and hence you can't do what the rest > of the world does with: This code is looks broken as written. vma_address() relies on certain properties that I maybe DAX (maybe even only FSDAX?) sets on its ZONE_DEVICE pages, and dev_pagemap_mapping_shift() does not handle the -EFAULT return. It will crash if a memory failure hits any other kind of ZONE_DEVICE area. I'm not sure the comment is correct anyhow: /* * Unmap the largest mapping to avoid breaking up * device-dax mappings which are constant size. The * actual size of the mapping being torn down is * communicated in siginfo, see kill_proc() */ unmap_mapping_range(page->mapping, start, size, 0); Beacuse for non PageAnon unmap_mapping_range() does either zap_huge_pud(), __split_huge_pmd(), or zap_huge_pmd(). Despite it's name __split_huge_pmd() does not actually split, it will call __split_huge_pmd_locked: } else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd))) goto out; __split_huge_pmd_locked(vma, pmd, range.start, freeze); Which does if (!vma_is_anonymous(vma)) { old_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd); Which is a zap, not split. So I wonder if there is a reason to use anything other than 4k here for DAX? > tk->size_shift = page_shift(compound_head(p)); > > ... as page_shift() would just return PAGE_SHIFT (as compound_order() is 0). And what would be so wrong with memory failure doing this as a 4k page? Jason ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: can we finally kill off CONFIG_FS_DAX_LIMITED 2021-10-18 23:30 ` Jason Gunthorpe @ 2021-10-19 4:26 ` Dan Williams 2021-10-19 14:20 ` Jason Gunthorpe 0 siblings, 1 reply; 9+ messages in thread From: Dan Williams @ 2021-10-19 4:26 UTC (permalink / raw) To: Jason Gunthorpe Cc: Joao Martins, Gerald Schaefer, Christoph Hellwig, Heiko Carstens, Vasily Gorbik, Christian Borntraeger, Linux NVDIMM, linux-s390, Matthew Wilcox, Alex Sierra, Kuehling, Felix, Linux MM, Ralph Campbell, Alistair Popple, Vishal Verma, Dave Jiang On Mon, Oct 18, 2021 at 4:31 PM Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Fri, Oct 15, 2021 at 01:22:41AM +0100, Joao Martins wrote: > > > dev_pagemap_mapping_shift() does a lookup to figure out > > which order is the page table entry represents. is_zone_device_page() > > is already used to gate usage of dev_pagemap_mapping_shift(). I think > > this might be an artifact of the same issue as 3) in which PMDs/PUDs > > are represented with base pages and hence you can't do what the rest > > of the world does with: > > This code is looks broken as written. > > vma_address() relies on certain properties that I maybe DAX (maybe > even only FSDAX?) sets on its ZONE_DEVICE pages, and > dev_pagemap_mapping_shift() does not handle the -EFAULT return. It > will crash if a memory failure hits any other kind of ZONE_DEVICE > area. That case is gated with a TODO in memory_failure_dev_pagemap(). I never got any response to queries about what to do about memory failure vs HMM. > > I'm not sure the comment is correct anyhow: > > /* > * Unmap the largest mapping to avoid breaking up > * device-dax mappings which are constant size. The > * actual size of the mapping being torn down is > * communicated in siginfo, see kill_proc() > */ > unmap_mapping_range(page->mapping, start, size, 0); > > Beacuse for non PageAnon unmap_mapping_range() does either > zap_huge_pud(), __split_huge_pmd(), or zap_huge_pmd(). > > Despite it's name __split_huge_pmd() does not actually split, it will > call __split_huge_pmd_locked: > > } else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd))) > goto out; > __split_huge_pmd_locked(vma, pmd, range.start, freeze); > > Which does > if (!vma_is_anonymous(vma)) { > old_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd); > > Which is a zap, not split. > > So I wonder if there is a reason to use anything other than 4k here > for DAX? > > > tk->size_shift = page_shift(compound_head(p)); > > > > ... as page_shift() would just return PAGE_SHIFT (as compound_order() is 0). > > And what would be so wrong with memory failure doing this as a 4k > page? device-dax does not support misaligned mappings. It makes hard guarantees for applications that can not afford the page table allocation overhead of sub-1GB mappings. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: can we finally kill off CONFIG_FS_DAX_LIMITED 2021-10-19 4:26 ` Dan Williams @ 2021-10-19 14:20 ` Jason Gunthorpe 2021-10-19 15:20 ` Joao Martins ` (2 more replies) 0 siblings, 3 replies; 9+ messages in thread From: Jason Gunthorpe @ 2021-10-19 14:20 UTC (permalink / raw) To: Dan Williams Cc: Joao Martins, Gerald Schaefer, Christoph Hellwig, Heiko Carstens, Vasily Gorbik, Christian Borntraeger, Linux NVDIMM, linux-s390, Matthew Wilcox, Alex Sierra, Kuehling, Felix, Linux MM, Ralph Campbell, Alistair Popple, Vishal Verma, Dave Jiang On Mon, Oct 18, 2021 at 09:26:24PM -0700, Dan Williams wrote: > On Mon, Oct 18, 2021 at 4:31 PM Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > On Fri, Oct 15, 2021 at 01:22:41AM +0100, Joao Martins wrote: > > > > > dev_pagemap_mapping_shift() does a lookup to figure out > > > which order is the page table entry represents. is_zone_device_page() > > > is already used to gate usage of dev_pagemap_mapping_shift(). I think > > > this might be an artifact of the same issue as 3) in which PMDs/PUDs > > > are represented with base pages and hence you can't do what the rest > > > of the world does with: > > > > This code is looks broken as written. > > > > vma_address() relies on certain properties that I maybe DAX (maybe > > even only FSDAX?) sets on its ZONE_DEVICE pages, and > > dev_pagemap_mapping_shift() does not handle the -EFAULT return. It > > will crash if a memory failure hits any other kind of ZONE_DEVICE > > area. > > That case is gated with a TODO in memory_failure_dev_pagemap(). I > never got any response to queries about what to do about memory > failure vs HMM. Unfortunately neither Logan nor Felix noticed that TODO conditional when adding new types.. But maybe it is dead code anyhow as it already has this: cookie = dax_lock_page(page); if (!cookie) goto out; Right before? Doesn't that already always fail for anything that isn't a DAX? > > I'm not sure the comment is correct anyhow: > > > > /* > > * Unmap the largest mapping to avoid breaking up > > * device-dax mappings which are constant size. The > > * actual size of the mapping being torn down is > > * communicated in siginfo, see kill_proc() > > */ > > unmap_mapping_range(page->mapping, start, size, 0); > > > > Beacuse for non PageAnon unmap_mapping_range() does either > > zap_huge_pud(), __split_huge_pmd(), or zap_huge_pmd(). > > > > Despite it's name __split_huge_pmd() does not actually split, it will > > call __split_huge_pmd_locked: > > > > } else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd))) > > goto out; > > __split_huge_pmd_locked(vma, pmd, range.start, freeze); > > > > Which does > > if (!vma_is_anonymous(vma)) { > > old_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd); > > > > Which is a zap, not split. > > > > So I wonder if there is a reason to use anything other than 4k here > > for DAX? > > > > > tk->size_shift = page_shift(compound_head(p)); > > > > > > ... as page_shift() would just return PAGE_SHIFT (as compound_order() is 0). > > > > And what would be so wrong with memory failure doing this as a 4k > > page? > > device-dax does not support misaligned mappings. It makes hard > guarantees for applications that can not afford the page table > allocation overhead of sub-1GB mappings. memory-failure is the wrong layer to enforce this anyhow - if someday unmap_mapping_range() did learn to break up the 1GB pages then we'd want to put the condition to preserve device-dax mappings there, not way up in memory-failure. So we can just delete the detection of the page size and rely on the zap code to wipe out the entire level, not split it. Which is what we have today already. Jason ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: can we finally kill off CONFIG_FS_DAX_LIMITED 2021-10-19 14:20 ` Jason Gunthorpe @ 2021-10-19 15:20 ` Joao Martins 2021-10-19 15:38 ` Felix Kuehling 2021-10-19 17:38 ` Dan Williams 2 siblings, 0 replies; 9+ messages in thread From: Joao Martins @ 2021-10-19 15:20 UTC (permalink / raw) To: Jason Gunthorpe Cc: Gerald Schaefer, Christoph Hellwig, Heiko Carstens, Vasily Gorbik, Christian Borntraeger, Linux NVDIMM, linux-s390, Matthew Wilcox, Alex Sierra, Kuehling, Felix, Linux MM, Ralph Campbell, Alistair Popple, Vishal Verma, Dave Jiang, Dan Williams On 10/19/21 15:20, Jason Gunthorpe wrote: > On Mon, Oct 18, 2021 at 09:26:24PM -0700, Dan Williams wrote: >> On Mon, Oct 18, 2021 at 4:31 PM Jason Gunthorpe <jgg@nvidia.com> wrote: >>> On Fri, Oct 15, 2021 at 01:22:41AM +0100, Joao Martins wrote: >>> I'm not sure the comment is correct anyhow: >>> >>> /* >>> * Unmap the largest mapping to avoid breaking up >>> * device-dax mappings which are constant size. The >>> * actual size of the mapping being torn down is >>> * communicated in siginfo, see kill_proc() >>> */ >>> unmap_mapping_range(page->mapping, start, size, 0); >>> >>> Beacuse for non PageAnon unmap_mapping_range() does either >>> zap_huge_pud(), __split_huge_pmd(), or zap_huge_pmd(). >>> >>> Despite it's name __split_huge_pmd() does not actually split, it will >>> call __split_huge_pmd_locked: >>> >>> } else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd))) >>> goto out; >>> __split_huge_pmd_locked(vma, pmd, range.start, freeze); >>> >>> Which does >>> if (!vma_is_anonymous(vma)) { >>> old_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd); >>> >>> Which is a zap, not split. >>> >>> So I wonder if there is a reason to use anything other than 4k here >>> for DAX? >>> >>>> tk->size_shift = page_shift(compound_head(p)); >>>> >>>> ... as page_shift() would just return PAGE_SHIFT (as compound_order() is 0). >>> >>> And what would be so wrong with memory failure doing this as a 4k >>> page? >> >> device-dax does not support misaligned mappings. It makes hard >> guarantees for applications that can not afford the page table >> allocation overhead of sub-1GB mappings. > > memory-failure is the wrong layer to enforce this anyhow - if someday > unmap_mapping_range() did learn to break up the 1GB pages then we'd > want to put the condition to preserve device-dax mappings there, not > way up in memory-failure. > > So we can just delete the detection of the page size and rely on the > zap code to wipe out the entire level, not split it. Which is what we > have today already. On a quick note, wrt to @size_shift: memory-failure reflects it back to userspace as contextual information (::addr_lsb) of the signal, when delivering the intended SIGBUS(code=BUS_MCEERR_*). So the size needs to be reported somehow. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: can we finally kill off CONFIG_FS_DAX_LIMITED 2021-10-19 14:20 ` Jason Gunthorpe 2021-10-19 15:20 ` Joao Martins @ 2021-10-19 15:38 ` Felix Kuehling 2021-10-19 17:38 ` Dan Williams 2 siblings, 0 replies; 9+ messages in thread From: Felix Kuehling @ 2021-10-19 15:38 UTC (permalink / raw) To: Jason Gunthorpe, Dan Williams Cc: Joao Martins, Gerald Schaefer, Christoph Hellwig, Heiko Carstens, Vasily Gorbik, Christian Borntraeger, Linux NVDIMM, linux-s390, Matthew Wilcox, Alex Sierra, Linux MM, Ralph Campbell, Alistair Popple, Vishal Verma, Dave Jiang, Phillips, Daniel Am 2021-10-19 um 10:20 a.m. schrieb Jason Gunthorpe: > On Mon, Oct 18, 2021 at 09:26:24PM -0700, Dan Williams wrote: >> On Mon, Oct 18, 2021 at 4:31 PM Jason Gunthorpe <jgg@nvidia.com> wrote: >>> On Fri, Oct 15, 2021 at 01:22:41AM +0100, Joao Martins wrote: >>> >>>> dev_pagemap_mapping_shift() does a lookup to figure out >>>> which order is the page table entry represents. is_zone_device_page() >>>> is already used to gate usage of dev_pagemap_mapping_shift(). I think >>>> this might be an artifact of the same issue as 3) in which PMDs/PUDs >>>> are represented with base pages and hence you can't do what the rest >>>> of the world does with: >>> This code is looks broken as written. >>> >>> vma_address() relies on certain properties that I maybe DAX (maybe >>> even only FSDAX?) sets on its ZONE_DEVICE pages, and >>> dev_pagemap_mapping_shift() does not handle the -EFAULT return. It >>> will crash if a memory failure hits any other kind of ZONE_DEVICE >>> area. >> That case is gated with a TODO in memory_failure_dev_pagemap(). I >> never got any response to queries about what to do about memory >> failure vs HMM. > Unfortunately neither Logan nor Felix noticed that TODO conditional > when adding new types.. You mean this? if (pgmap->type == MEMORY_DEVICE_PRIVATE) { /* * TODO: Handle HMM pages which may need coordination * with device-side memory. */ goto unlock; } Yeah, I never looked at that. Alex, we'll need to add || pgmap->type == MEMORY_DEVICE_COHERENT here. Or should we change this into a test that looks for the pgmap->types that are actually handled by memory_failure_dev_pagemap? E.g. if (pgmap->type != MEMORY_DEVICE_FS_DAX) goto unlock; I think in case of a real HW error, our driver should be calling memory_failure. But then a callback from here back into the driver wouldn't make sense. For MADV_HWPOISON we may need a callback to the driver, if we want the driver to treat it like an actual HW error and retire the page. > > But maybe it is dead code anyhow as it already has this: > > cookie = dax_lock_page(page); > if (!cookie) > goto out; > > Right before? Doesn't that already always fail for anything that isn't > a DAX? I guess the check for the pgmap->type should come before this. Regards, Felix > >>> I'm not sure the comment is correct anyhow: >>> >>> /* >>> * Unmap the largest mapping to avoid breaking up >>> * device-dax mappings which are constant size. The >>> * actual size of the mapping being torn down is >>> * communicated in siginfo, see kill_proc() >>> */ >>> unmap_mapping_range(page->mapping, start, size, 0); >>> >>> Beacuse for non PageAnon unmap_mapping_range() does either >>> zap_huge_pud(), __split_huge_pmd(), or zap_huge_pmd(). >>> >>> Despite it's name __split_huge_pmd() does not actually split, it will >>> call __split_huge_pmd_locked: >>> >>> } else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd))) >>> goto out; >>> __split_huge_pmd_locked(vma, pmd, range.start, freeze); >>> >>> Which does >>> if (!vma_is_anonymous(vma)) { >>> old_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd); >>> >>> Which is a zap, not split. >>> >>> So I wonder if there is a reason to use anything other than 4k here >>> for DAX? >>> >>>> tk->size_shift = page_shift(compound_head(p)); >>>> >>>> ... as page_shift() would just return PAGE_SHIFT (as compound_order() is 0). >>> And what would be so wrong with memory failure doing this as a 4k >>> page? >> device-dax does not support misaligned mappings. It makes hard >> guarantees for applications that can not afford the page table >> allocation overhead of sub-1GB mappings. > memory-failure is the wrong layer to enforce this anyhow - if someday > unmap_mapping_range() did learn to break up the 1GB pages then we'd > want to put the condition to preserve device-dax mappings there, not > way up in memory-failure. > > So we can just delete the detection of the page size and rely on the > zap code to wipe out the entire level, not split it. Which is what we > have today already. > > Jason ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: can we finally kill off CONFIG_FS_DAX_LIMITED 2021-10-19 14:20 ` Jason Gunthorpe 2021-10-19 15:20 ` Joao Martins 2021-10-19 15:38 ` Felix Kuehling @ 2021-10-19 17:38 ` Dan Williams 2021-10-19 17:54 ` Jason Gunthorpe 2 siblings, 1 reply; 9+ messages in thread From: Dan Williams @ 2021-10-19 17:38 UTC (permalink / raw) To: Jason Gunthorpe Cc: Joao Martins, Gerald Schaefer, Christoph Hellwig, Heiko Carstens, Vasily Gorbik, Christian Borntraeger, Linux NVDIMM, linux-s390, Matthew Wilcox, Alex Sierra, Kuehling, Felix, Linux MM, Ralph Campbell, Alistair Popple, Vishal Verma, Dave Jiang On Tue, Oct 19, 2021 at 7:25 AM Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Mon, Oct 18, 2021 at 09:26:24PM -0700, Dan Williams wrote: > > On Mon, Oct 18, 2021 at 4:31 PM Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > > On Fri, Oct 15, 2021 at 01:22:41AM +0100, Joao Martins wrote: > > > > > > > dev_pagemap_mapping_shift() does a lookup to figure out > > > > which order is the page table entry represents. is_zone_device_page() > > > > is already used to gate usage of dev_pagemap_mapping_shift(). I think > > > > this might be an artifact of the same issue as 3) in which PMDs/PUDs > > > > are represented with base pages and hence you can't do what the rest > > > > of the world does with: > > > > > > This code is looks broken as written. > > > > > > vma_address() relies on certain properties that I maybe DAX (maybe > > > even only FSDAX?) sets on its ZONE_DEVICE pages, and > > > dev_pagemap_mapping_shift() does not handle the -EFAULT return. It > > > will crash if a memory failure hits any other kind of ZONE_DEVICE > > > area. > > > > That case is gated with a TODO in memory_failure_dev_pagemap(). I > > never got any response to queries about what to do about memory > > failure vs HMM. > > Unfortunately neither Logan nor Felix noticed that TODO conditional > when adding new types.. > > But maybe it is dead code anyhow as it already has this: > > cookie = dax_lock_page(page); > if (!cookie) > goto out; > > Right before? Doesn't that already always fail for anything that isn't > a DAX? Yes, I originally made that ordering mistake in: 6100e34b2526 mm, memory_failure: Teach memory_failure() about dev_pagemap pages ...however, if we complete the move away from page-less DAX it also allows for the locking to move from the xarray to lock_page(). I.e. dax_lock_page() is pinning the inode after the fact, but I suspect the inode should have been pinned when the mapping was established. Which raises a question for the reflink support whether it is pinning all involved inodes while the mapping is established? > > > > I'm not sure the comment is correct anyhow: > > > > > > /* > > > * Unmap the largest mapping to avoid breaking up > > > * device-dax mappings which are constant size. The > > > * actual size of the mapping being torn down is > > > * communicated in siginfo, see kill_proc() > > > */ > > > unmap_mapping_range(page->mapping, start, size, 0); > > > > > > Beacuse for non PageAnon unmap_mapping_range() does either > > > zap_huge_pud(), __split_huge_pmd(), or zap_huge_pmd(). > > > > > > Despite it's name __split_huge_pmd() does not actually split, it will > > > call __split_huge_pmd_locked: > > > > > > } else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd))) > > > goto out; > > > __split_huge_pmd_locked(vma, pmd, range.start, freeze); > > > > > > Which does > > > if (!vma_is_anonymous(vma)) { > > > old_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd); > > > > > > Which is a zap, not split. > > > > > > So I wonder if there is a reason to use anything other than 4k here > > > for DAX? > > > > > > > tk->size_shift = page_shift(compound_head(p)); > > > > > > > > ... as page_shift() would just return PAGE_SHIFT (as compound_order() is 0). > > > > > > And what would be so wrong with memory failure doing this as a 4k > > > page? > > > > device-dax does not support misaligned mappings. It makes hard > > guarantees for applications that can not afford the page table > > allocation overhead of sub-1GB mappings. > > memory-failure is the wrong layer to enforce this anyhow - if someday > unmap_mapping_range() did learn to break up the 1GB pages then we'd > want to put the condition to preserve device-dax mappings there, not > way up in memory-failure. > > So we can just delete the detection of the page size and rely on the > zap code to wipe out the entire level, not split it. Which is what we > have today already. As Joao points out, userspace wants to know the blast radius of the unmap for historical reasons. I do think it's worth deprecating that somehow... providing a better error management interface is part of the DAX-reflink enabling. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: can we finally kill off CONFIG_FS_DAX_LIMITED 2021-10-19 17:38 ` Dan Williams @ 2021-10-19 17:54 ` Jason Gunthorpe 0 siblings, 0 replies; 9+ messages in thread From: Jason Gunthorpe @ 2021-10-19 17:54 UTC (permalink / raw) To: Dan Williams Cc: Joao Martins, Gerald Schaefer, Christoph Hellwig, Heiko Carstens, Vasily Gorbik, Christian Borntraeger, Linux NVDIMM, linux-s390, Matthew Wilcox, Alex Sierra, Kuehling, Felix, Linux MM, Ralph Campbell, Alistair Popple, Vishal Verma, Dave Jiang On Tue, Oct 19, 2021 at 10:38:42AM -0700, Dan Williams wrote: > > So we can just delete the detection of the page size and rely on the > > zap code to wipe out the entire level, not split it. Which is what we > > have today already. > > As Joao points out, userspace wants to know the blast radius of the > unmap for historical reasons. I do think it's worth deprecating that > somehow... providing a better error management interface is part of > the DAX-reflink enabling. OK, it makes sense. I have a less invasive idea though - emulate what zap is doing: if (!pud_present(*pud)) return 0; if (pud_leaf(*pud)) return PUD_SHIFT; if (!pmd_present(*pud)) return 0; if (pmd_leaf(*pud)) return PMD_SHIFT; return PAGE_SHIFT; Which would return the "blast radius" of the unmap_mapping_range() when it rounds up to the left page level that contains the VA. Now it doesn't need the pte_devmap test.. And when both DAX's learn to use compound_head this can be deleted. Jason ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2021-10-19 17:54 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <20210820054340.GA28560@lst.de> [not found] ` <20210823160546.0bf243bf@thinkpad> [not found] ` <20210823214708.77979b3f@thinkpad> [not found] ` <CAPcyv4jijqrb1O5OOTd5ftQ2Q-5SVwNRM7XMQ+N3MAFxEfvxpA@mail.gmail.com> [not found] ` <e250feab-1873-c91d-5ea9-39ac6ef26458@oracle.com> [not found] ` <CAPcyv4jYXPWmT2EzroTa7RDz1Z68Qz8Uj4MeheQHPbBXdfS4pA@mail.gmail.com> [not found] ` <20210824202449.19d524b5@thinkpad> [not found] ` <CAPcyv4iFeVDVPn6uc=aKsyUvkiu3-fK-N16iJVZQ3N8oT00hWA@mail.gmail.com> 2021-10-14 23:04 ` can we finally kill off CONFIG_FS_DAX_LIMITED Jason Gunthorpe 2021-10-15 0:22 ` Joao Martins 2021-10-18 23:30 ` Jason Gunthorpe 2021-10-19 4:26 ` Dan Williams 2021-10-19 14:20 ` Jason Gunthorpe 2021-10-19 15:20 ` Joao Martins 2021-10-19 15:38 ` Felix Kuehling 2021-10-19 17:38 ` Dan Williams 2021-10-19 17:54 ` Jason Gunthorpe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).