* Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount [not found] ` <20211014170634.GV2744544@nvidia.com> @ 2021-10-14 18:43 ` Matthew Wilcox 2021-10-14 19:01 ` Dan Williams 0 siblings, 1 reply; 17+ messages in thread From: Matthew Wilcox @ 2021-10-14 18:43 UTC (permalink / raw) To: Jason Gunthorpe Cc: Alex Sierra, akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs, amd-gfx, dri-devel, hch, jglisse, apopple, Dan Williams, Vishal Verma, Dave Jiang, nvdimm It would probably help if you cc'd Dan on this. As far as I know he's the only person left who cares about GUP on DAX. On Thu, Oct 14, 2021 at 02:06:34PM -0300, Jason Gunthorpe wrote: > On Thu, Oct 14, 2021 at 10:39:28AM -0500, Alex Sierra wrote: > > From: Ralph Campbell <rcampbell@nvidia.com> > > > > ZONE_DEVICE struct pages have an extra reference count that complicates the > > code for put_page() and several places in the kernel that need to check the > > reference count to see that a page is not being used (gup, compaction, > > migration, etc.). Clean up the code so the reference count doesn't need to > > be treated specially for ZONE_DEVICE. > > > > Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> > > Signed-off-by: Alex Sierra <alex.sierra@amd.com> > > Reviewed-by: Christoph Hellwig <hch@lst.de> > > --- > > v2: > > AS: merged this patch in linux 5.11 version > > > > v5: > > AS: add condition at try_grab_page to check for the zone device type, while > > page ref counter is checked less/equal to zero. In case of device zone, pages > > ref counter are initialized to zero. > > > > v7: > > AS: fix condition at try_grab_page added at v5, is invalid. It supposed > > to fix xfstests/generic/413 test, however, there's a known issue on > > this test where DAX mapped area DIO to non-DAX expect to fail. > > https://patchwork.kernel.org/project/fstests/patch/1489463960-3579-1-git-send-email-xzhou@redhat.com > > This condition was removed after rebase over patch series > > https://lore.kernel.org/r/20210813044133.1536842-4-jhubbard@nvidia.com > > --- > > arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +- > > drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +- > > fs/dax.c | 4 +- > > include/linux/dax.h | 2 +- > > include/linux/memremap.h | 7 +-- > > include/linux/mm.h | 11 ---- > > lib/test_hmm.c | 2 +- > > mm/internal.h | 8 +++ > > mm/memcontrol.c | 6 +-- > > mm/memremap.c | 69 +++++++------------------- > > mm/migrate.c | 5 -- > > mm/page_alloc.c | 3 ++ > > mm/swap.c | 45 ++--------------- > > 13 files changed, 46 insertions(+), 120 deletions(-) > > Has anyone tested this with FSDAX? Does get_user_pages() on fsdax > backed memory still work? > > What refcount value does the struct pages have when they are installed > in the PTEs? Remember a 0 refcount will make all the get_user_pages() > fail. > > I'm looking at the call path starting in ext4_punch_hole() and I would > expect to see something manipulating the page ref count before > the ext4_break_layouts() call path gets to the dax_page_unused() test. > > All I see is we go into unmap_mapping_pages() - that would normally > put back the page references held by PTEs but insert_pfn() has this: > > if (pfn_t_devmap(pfn)) > entry = pte_mkdevmap(pfn_t_pte(pfn, prot)); > > And: > > static inline pte_t pte_mkdevmap(pte_t pte) > { > return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP); > } > > Which interacts with vm_normal_page(): > > if (pte_devmap(pte)) > return NULL; > > To disable that refcounting? > > So... I have a feeling this will have PTEs pointing to 0 refcount > pages? Unless FSDAX is !pte_devmap which is not the case, right? > > This seems further confirmed by this comment: > > /* > * If we race get_user_pages_fast() here either we'll see the > * elevated page count in the iteration and wait, or > * get_user_pages_fast() will see that the page it took a reference > * against is no longer mapped in the page tables and bail to the > * get_user_pages() slow path. The slow path is protected by > * pte_lock() and pmd_lock(). New references are not taken without > * holding those locks, and unmap_mapping_pages() will not zero the > * pte or pmd without holding the respective lock, so we are > * guaranteed to either see new references or prevent new > * references from being established. > */ > > Which seems to explain this scheme relies on unmap_mapping_pages() to > fence GUP_fast, not on GUP_fast observing 0 refcounts when it should > stop. > > This seems like it would be properly fixed by using normal page > refcounting for PTEs - ie stop using special for these pages? > > Does anyone know why devmap is pte_special anyhow? > > > +void free_zone_device_page(struct page *page) > > +{ > > + switch (page->pgmap->type) { > > + case MEMORY_DEVICE_PRIVATE: > > + free_device_page(page); > > + return; > > + case MEMORY_DEVICE_FS_DAX: > > + /* notify page idle */ > > + wake_up_var(&page->_refcount); > > + return; > > It is not for this series, but I wonder if we should just always call > ops->page_free and have free_device_page() logic in that callback for > the non-fs-dax cases? > > For instance where is the mem_cgroup_charge() call to pair with the > mem_cgroup_uncharge() in free_device_page()? > > Isn't cgroup charging (or not) the responsibility of the "allocator" > eg the pgmap_ops owner? > > Jason ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount 2021-10-14 18:43 ` [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount Matthew Wilcox @ 2021-10-14 19:01 ` Dan Williams 2021-10-14 23:06 ` Jason Gunthorpe 0 siblings, 1 reply; 17+ messages in thread From: Dan Williams @ 2021-10-14 19:01 UTC (permalink / raw) To: Matthew Wilcox Cc: Jason Gunthorpe, Alex Sierra, Andrew Morton, Kuehling, Felix, Linux MM, Ralph Campbell, linux-ext4, linux-xfs, amd-gfx list, Maling list - DRI developers, Christoph Hellwig, Jérôme Glisse, Alistair Popple, Vishal Verma, Dave Jiang, Linux NVDIMM On Thu, Oct 14, 2021 at 11:45 AM Matthew Wilcox <willy@infradead.org> wrote: > > > It would probably help if you cc'd Dan on this. Thanks. [..] > > On Thu, Oct 14, 2021 at 02:06:34PM -0300, Jason Gunthorpe wrote: > > On Thu, Oct 14, 2021 at 10:39:28AM -0500, Alex Sierra wrote: > > > From: Ralph Campbell <rcampbell@nvidia.com> > > > > > > ZONE_DEVICE struct pages have an extra reference count that complicates the > > > code for put_page() and several places in the kernel that need to check the > > > reference count to see that a page is not being used (gup, compaction, > > > migration, etc.). Clean up the code so the reference count doesn't need to > > > be treated specially for ZONE_DEVICE. > > > > > > Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> > > > Signed-off-by: Alex Sierra <alex.sierra@amd.com> > > > Reviewed-by: Christoph Hellwig <hch@lst.de> > > > --- > > > v2: > > > AS: merged this patch in linux 5.11 version > > > > > > v5: > > > AS: add condition at try_grab_page to check for the zone device type, while > > > page ref counter is checked less/equal to zero. In case of device zone, pages > > > ref counter are initialized to zero. > > > > > > v7: > > > AS: fix condition at try_grab_page added at v5, is invalid. It supposed > > > to fix xfstests/generic/413 test, however, there's a known issue on > > > this test where DAX mapped area DIO to non-DAX expect to fail. > > > https://patchwork.kernel.org/project/fstests/patch/1489463960-3579-1-git-send-email-xzhou@redhat.com > > > This condition was removed after rebase over patch series > > > https://lore.kernel.org/r/20210813044133.1536842-4-jhubbard@nvidia.com > > > --- > > > arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +- > > > drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +- > > > fs/dax.c | 4 +- > > > include/linux/dax.h | 2 +- > > > include/linux/memremap.h | 7 +-- > > > include/linux/mm.h | 11 ---- > > > lib/test_hmm.c | 2 +- > > > mm/internal.h | 8 +++ > > > mm/memcontrol.c | 6 +-- > > > mm/memremap.c | 69 +++++++------------------- > > > mm/migrate.c | 5 -- > > > mm/page_alloc.c | 3 ++ > > > mm/swap.c | 45 ++--------------- > > > 13 files changed, 46 insertions(+), 120 deletions(-) > > > > Has anyone tested this with FSDAX? Does get_user_pages() on fsdax > > backed memory still work? > > > > What refcount value does the struct pages have when they are installed > > in the PTEs? Remember a 0 refcount will make all the get_user_pages() > > fail. > > > > I'm looking at the call path starting in ext4_punch_hole() and I would > > expect to see something manipulating the page ref count before > > the ext4_break_layouts() call path gets to the dax_page_unused() test. > > > > All I see is we go into unmap_mapping_pages() - that would normally > > put back the page references held by PTEs but insert_pfn() has this: > > > > if (pfn_t_devmap(pfn)) > > entry = pte_mkdevmap(pfn_t_pte(pfn, prot)); > > > > And: > > > > static inline pte_t pte_mkdevmap(pte_t pte) > > { > > return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP); > > } > > > > Which interacts with vm_normal_page(): > > > > if (pte_devmap(pte)) > > return NULL; > > > > To disable that refcounting? > > > > So... I have a feeling this will have PTEs pointing to 0 refcount > > pages? Unless FSDAX is !pte_devmap which is not the case, right? > > > > This seems further confirmed by this comment: > > > > /* > > * If we race get_user_pages_fast() here either we'll see the > > * elevated page count in the iteration and wait, or > > * get_user_pages_fast() will see that the page it took a reference > > * against is no longer mapped in the page tables and bail to the > > * get_user_pages() slow path. The slow path is protected by > > * pte_lock() and pmd_lock(). New references are not taken without > > * holding those locks, and unmap_mapping_pages() will not zero the > > * pte or pmd without holding the respective lock, so we are > > * guaranteed to either see new references or prevent new > > * references from being established. > > */ > > > > Which seems to explain this scheme relies on unmap_mapping_pages() to > > fence GUP_fast, not on GUP_fast observing 0 refcounts when it should > > stop. > > > > This seems like it would be properly fixed by using normal page > > refcounting for PTEs - ie stop using special for these pages? > > > > Does anyone know why devmap is pte_special anyhow? It does not need to be special as mentioned here: https://lore.kernel.org/all/CAPcyv4iFeVDVPn6uc=aKsyUvkiu3-fK-N16iJVZQ3N8oT00hWA@mail.gmail.com/ The refcount dependencies also go away after this... https://lore.kernel.org/all/161604050866.1463742.7759521510383551055.stgit@dwillia2-desk3.amr.corp.intel.com/ ...but you can see that patches 1 and 2 in that series depend on being able to guarantee that all mappings are invalidated when the undelying device that owns the pgmap goes away. For that to happen there needs to be communication back to the FS for device-gone / failure events. That work is in progress via this series: https://lore.kernel.org/all/20210924130959.2695749-1-ruansy.fnst@fujitsu.com/ So there's a path to unwind this awkwardness, but it needs some dominoes to fall first as far as I can see. My current focus is getting Shiyang's series unblocked. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount 2021-10-14 19:01 ` Dan Williams @ 2021-10-14 23:06 ` Jason Gunthorpe 2021-10-15 1:37 ` Dan Williams 0 siblings, 1 reply; 17+ messages in thread From: Jason Gunthorpe @ 2021-10-14 23:06 UTC (permalink / raw) To: Dan Williams Cc: Matthew Wilcox, Alex Sierra, Andrew Morton, Kuehling, Felix, Linux MM, Ralph Campbell, linux-ext4, linux-xfs, amd-gfx list, Maling list - DRI developers, Christoph Hellwig, Jérôme Glisse, Alistair Popple, Vishal Verma, Dave Jiang, Linux NVDIMM On Thu, Oct 14, 2021 at 12:01:14PM -0700, Dan Williams wrote: > > > Does anyone know why devmap is pte_special anyhow? > > It does not need to be special as mentioned here: > > https://lore.kernel.org/all/CAPcyv4iFeVDVPn6uc=aKsyUvkiu3-fK-N16iJVZQ3N8oT00hWA@mail.gmail.com/ I added a remark there Not special means more to me, it means devmap should do the refcounts properly like normal memory pages. It means vm_normal_page should return !NULL and it means insert_page, not insert_pfn should be used to install them in the PTE. VMAs should not be MIXED MAP, but normal struct page maps. I think this change alone would fix all the refcount problems everwhere in DAX and devmap. > The refcount dependencies also go away after this... > > https://lore.kernel.org/all/161604050866.1463742.7759521510383551055.stgit@dwillia2-desk3.amr.corp.intel.com/ > > ...but you can see that patches 1 and 2 in that series depend on being > able to guarantee that all mappings are invalidated when the undelying > device that owns the pgmap goes away. If I have put everything together right this is because of what I pointed to here. FS-DAX is installing 0 refcount pages into PTEs and expecting that to work sanely. This means the page map cannot be removed until all the PTEs are fully flushed, which buggily doesn't happen because of the missing unplug. However, this is all because nobody incrd a refcount to represent the reference in the PTE and since this ment that 0 refcount pages were wrongly stuffed into PTEs then devmap used the refcount == 1 hack to unbreak GUP? So.. Is there some reason why devmap pages are trying so hard to avoid sane refcounting??? If the PTE itself holds the refcount (by not being special) then there is no need for the pagemap stuff in GUP. pagemap already waits for refs to go to 0 so the missing shootdown during nvdimm unplug will cause pagemap to block until the address spaces are invalidated. IMHO this is already better than the current buggy situation of allowing continued PTE reference to memory that is now removed from the system. > For that to happen there needs to be communication back to the FS for > device-gone / failure events. That work is in progress via this > series: > > https://lore.kernel.org/all/20210924130959.2695749-1-ruansy.fnst@fujitsu.com/ This is fine, but I don't think it should block fixing the mm side - the end result here still cannot be 0 ref count pages installed in PTEs. Fixing that does not depend on shootdown during device removal, right? It requires holding refcounts while pages are installed into address spaces - and this lack is a direct cause of making the PTEs all special and using insert_pfn and MIXED_MAP. Thanks, Jason ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount 2021-10-14 23:06 ` Jason Gunthorpe @ 2021-10-15 1:37 ` Dan Williams 2021-10-16 15:44 ` Jason Gunthorpe 0 siblings, 1 reply; 17+ messages in thread From: Dan Williams @ 2021-10-15 1:37 UTC (permalink / raw) To: Jason Gunthorpe Cc: Matthew Wilcox, Alex Sierra, Andrew Morton, Kuehling, Felix, Linux MM, Ralph Campbell, linux-ext4, linux-xfs, amd-gfx list, Maling list - DRI developers, Christoph Hellwig, Jérôme Glisse, Alistair Popple, Vishal Verma, Dave Jiang, Linux NVDIMM On Thu, Oct 14, 2021 at 4:06 PM Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Thu, Oct 14, 2021 at 12:01:14PM -0700, Dan Williams wrote: > > > > Does anyone know why devmap is pte_special anyhow? > > > > It does not need to be special as mentioned here: > > > > https://lore.kernel.org/all/CAPcyv4iFeVDVPn6uc=aKsyUvkiu3-fK-N16iJVZQ3N8oT00hWA@mail.gmail.com/ > > I added a remark there > > Not special means more to me, it means devmap should do the refcounts > properly like normal memory pages. > > It means vm_normal_page should return !NULL and it means insert_page, > not insert_pfn should be used to install them in the PTE. VMAs should > not be MIXED MAP, but normal struct page maps. > > I think this change alone would fix all the refcount problems > everwhere in DAX and devmap. > > > The refcount dependencies also go away after this... > > > > https://lore.kernel.org/all/161604050866.1463742.7759521510383551055.stgit@dwillia2-desk3.amr.corp.intel.com/ > > > > ...but you can see that patches 1 and 2 in that series depend on being > > able to guarantee that all mappings are invalidated when the undelying > > device that owns the pgmap goes away. > > If I have put everything together right this is because of what I > pointed to here. FS-DAX is installing 0 refcount pages into PTEs and > expecting that to work sanely. > > This means the page map cannot be removed until all the PTEs are fully > flushed, which buggily doesn't happen because of the missing unplug. > > However, this is all because nobody incrd a refcount to represent the > reference in the PTE and since this ment that 0 refcount pages were > wrongly stuffed into PTEs then devmap used the refcount == 1 hack to > unbreak GUP? > > So.. Is there some reason why devmap pages are trying so hard to avoid > sane refcounting??? I wouldn't put it that way. It's more that the original sin of ZONE_DEVICE that sought to reuse the lru field space, by never having a zero recount, then got layered upon and calcified in malignant ways. In the meantime surrounding infrastructure got decrustified. Work like the 'struct page' cleanup among other things, made it clearer and clearer over time that the original design choice needed to be fixed. > If the PTE itself holds the refcount (by not being special) then there > is no need for the pagemap stuff in GUP. pagemap already waits for > refs to go to 0 so the missing shootdown during nvdimm unplug will > cause pagemap to block until the address spaces are invalidated. IMHO > this is already better than the current buggy situation of allowing > continued PTE reference to memory that is now removed from the system. > > > For that to happen there needs to be communication back to the FS for > > device-gone / failure events. That work is in progress via this > > series: > > > > https://lore.kernel.org/all/20210924130959.2695749-1-ruansy.fnst@fujitsu.com/ > > This is fine, but I don't think it should block fixing the mm side - > the end result here still cannot be 0 ref count pages installed in > PTEs. > > Fixing that does not depend on shootdown during device removal, right? > > It requires holding refcounts while pages are installed into address > spaces - and this lack is a direct cause of making the PTEs all > special and using insert_pfn and MIXED_MAP. The MIXED_MAP and insert_pfn were a holdover from page-less DAX, but now that we have page-available DAX, yes, we can skip the FS notification and just rely on typical refcounting and hanging until the FS has a chance to uninstall the PTEs. You're right, the FS notification is an improvement to the conversion, not a requirement. However, there still needs to be something in the gup-fast path to indicate that GUP_LONGTERM is not possible because the PTE represents a pfn that can not support typical page-cache behavior for truncate which is to just disconnect the page from the file and keep the page pinned indefinitely. I think the "no longterm" caveat would be the only remaining utility of PTE_DEVMAP after the above conversion to use typical page refcounts throughout DAX. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount 2021-10-15 1:37 ` Dan Williams @ 2021-10-16 15:44 ` Jason Gunthorpe 2021-10-16 16:39 ` Matthew Wilcox 2021-10-17 18:35 ` Dan Williams 0 siblings, 2 replies; 17+ messages in thread From: Jason Gunthorpe @ 2021-10-16 15:44 UTC (permalink / raw) To: Dan Williams Cc: Matthew Wilcox, Alex Sierra, Andrew Morton, Kuehling, Felix, Linux MM, Ralph Campbell, linux-ext4, linux-xfs, amd-gfx list, Maling list - DRI developers, Christoph Hellwig, Jérôme Glisse, Alistair Popple, Vishal Verma, Dave Jiang, Linux NVDIMM On Thu, Oct 14, 2021 at 06:37:35PM -0700, Dan Williams wrote: > On Thu, Oct 14, 2021 at 4:06 PM Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > On Thu, Oct 14, 2021 at 12:01:14PM -0700, Dan Williams wrote: > > > > > Does anyone know why devmap is pte_special anyhow? > > > > > > It does not need to be special as mentioned here: > > > > > > https://lore.kernel.org/all/CAPcyv4iFeVDVPn6uc=aKsyUvkiu3-fK-N16iJVZQ3N8oT00hWA@mail.gmail.com/ > > > > I added a remark there > > > > Not special means more to me, it means devmap should do the refcounts > > properly like normal memory pages. > > > > It means vm_normal_page should return !NULL and it means insert_page, > > not insert_pfn should be used to install them in the PTE. VMAs should > > not be MIXED MAP, but normal struct page maps. > > > > I think this change alone would fix all the refcount problems > > everwhere in DAX and devmap. > > > > > The refcount dependencies also go away after this... > > > > > > https://lore.kernel.org/all/161604050866.1463742.7759521510383551055.stgit@dwillia2-desk3.amr.corp.intel.com/ > > > > > > ...but you can see that patches 1 and 2 in that series depend on being > > > able to guarantee that all mappings are invalidated when the undelying > > > device that owns the pgmap goes away. > > > > If I have put everything together right this is because of what I > > pointed to here. FS-DAX is installing 0 refcount pages into PTEs and > > expecting that to work sanely. > > > > This means the page map cannot be removed until all the PTEs are fully > > flushed, which buggily doesn't happen because of the missing unplug. > > > > However, this is all because nobody incrd a refcount to represent the > > reference in the PTE and since this ment that 0 refcount pages were > > wrongly stuffed into PTEs then devmap used the refcount == 1 hack to > > unbreak GUP? > > > > So.. Is there some reason why devmap pages are trying so hard to avoid > > sane refcounting??? > > I wouldn't put it that way. It's more that the original sin of > ZONE_DEVICE that sought to reuse the lru field space, by never having > a zero recount, then got layered upon and calcified in malignant ways. > In the meantime surrounding infrastructure got decrustified. Work like > the 'struct page' cleanup among other things, made it clearer and > clearer over time that the original design choice needed to be fixed. So, there used to be some reason, but with the current code arrangement it is not the case? This is why it looks so strange when reading it.. AFIACT we are good on the LRU stuff now? Eg release_pages() does not use page->lru for is_zone_device_page()? > The MIXED_MAP and insert_pfn were a holdover from page-less DAX, but > now that we have page-available DAX, yes, we can skip the FS > notification and just rely on typical refcounting and hanging until > the FS has a chance to uninstall the PTEs. You're right, the FS > notification is an improvement to the conversion, not a requirement. It all sounds so simple. I looked at this for a good long time and the first major roadblock is huge pages. The mm side is designed around THP which puts a single high order page into the PUD/PMD such that the mm only has to juggle one page. This a very sane and reasonable thing to do. DAX is stuffing arrays of 4k pages into the PUD/PMDs. Aligning with THP would make using normal refconting much simpler. I looked at teaching the mm core to deal with page arrays - it is certainly doable, but it is quite inefficient and ugly mm code. So, can we fix DAX and TTM - the only uses of PUD/PMDs I could find? Joao has a series that does this to device-dax: https://lore.kernel.org/all/20210827145819.16471-1-joao.m.martins@oracle.com/ TTM is kind of broken already but does have a struct page, and it is probably already a high order one. Maybe it is OK? I will ask Thomas That leaves FSDAX. Can this be fixed? I know nothing of filesystems or fsdax to guess. Sounds like folios to me .. Assuming changing FSDAX is hard.. How would DAX people feel about just deleting the PUD/PMD support until it can be done with compound pages? Doing so would allow fixing the lifecycle, cleaning up gup and basically delete a huge wack of slow DAX and devmap specific code from the mm. It also opens the door to removing the PTE flag and thus allowing DAX on more architectures. > However, there still needs to be something in the gup-fast path to > indicate that GUP_LONGTERM is not possible because the PTE > represents It looks easy now: 1) We know the pfn has a struct page * because it isn't pfn special 2) We can get a valid ref on the struct page *: head = try_grab_compound_head(page, 1, flags); Holding a ref ensures that head->pgmap is valid. 3) Then check the page type directly: if ((flags & FOLL_LONGTERM) && is_zone_device_page(head)) This tells us we can access the ZONE_DEVICE struct in the union 4) Ask what the pgmap owner wants to do: if (head->pgmap->deny_foll_longterm) return FAIL Cost is only paied if FOLL_LONGTERM is given Thanks, Jason ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount 2021-10-16 15:44 ` Jason Gunthorpe @ 2021-10-16 16:39 ` Matthew Wilcox 2021-10-17 18:20 ` Dan Williams 2021-10-17 18:35 ` Dan Williams 1 sibling, 1 reply; 17+ messages in thread From: Matthew Wilcox @ 2021-10-16 16:39 UTC (permalink / raw) To: Jason Gunthorpe Cc: Dan Williams, Alex Sierra, Andrew Morton, Kuehling, Felix, Linux MM, Ralph Campbell, linux-ext4, linux-xfs, amd-gfx list, Maling list - DRI developers, Christoph Hellwig, Jérôme Glisse, Alistair Popple, Vishal Verma, Dave Jiang, Linux NVDIMM On Sat, Oct 16, 2021 at 12:44:50PM -0300, Jason Gunthorpe wrote: > Assuming changing FSDAX is hard.. How would DAX people feel about just > deleting the PUD/PMD support until it can be done with compound pages? I think there are customers who would find that an unacceptable answer :-) ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount 2021-10-16 16:39 ` Matthew Wilcox @ 2021-10-17 18:20 ` Dan Williams 0 siblings, 0 replies; 17+ messages in thread From: Dan Williams @ 2021-10-17 18:20 UTC (permalink / raw) To: Matthew Wilcox Cc: Jason Gunthorpe, Alex Sierra, Andrew Morton, Kuehling, Felix, Linux MM, Ralph Campbell, linux-ext4, linux-xfs, amd-gfx list, Maling list - DRI developers, Christoph Hellwig, Jérôme Glisse, Alistair Popple, Vishal Verma, Dave Jiang, Linux NVDIMM On Sat, Oct 16, 2021 at 9:39 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Sat, Oct 16, 2021 at 12:44:50PM -0300, Jason Gunthorpe wrote: > > Assuming changing FSDAX is hard.. How would DAX people feel about just > > deleting the PUD/PMD support until it can be done with compound pages? > > I think there are customers who would find that an unacceptable answer :-) No, not given the number of end users that ask for help debugging PMD support. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount 2021-10-16 15:44 ` Jason Gunthorpe 2021-10-16 16:39 ` Matthew Wilcox @ 2021-10-17 18:35 ` Dan Williams 2021-10-18 18:25 ` Jason Gunthorpe 1 sibling, 1 reply; 17+ messages in thread From: Dan Williams @ 2021-10-17 18:35 UTC (permalink / raw) To: Jason Gunthorpe Cc: Matthew Wilcox, Alex Sierra, Andrew Morton, Kuehling, Felix, Linux MM, Ralph Campbell, linux-ext4, linux-xfs, amd-gfx list, Maling list - DRI developers, Christoph Hellwig, Jérôme Glisse, Alistair Popple, Vishal Verma, Dave Jiang, Linux NVDIMM, David Hildenbrand On Sat, Oct 16, 2021 at 8:45 AM Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Thu, Oct 14, 2021 at 06:37:35PM -0700, Dan Williams wrote: > > On Thu, Oct 14, 2021 at 4:06 PM Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > > On Thu, Oct 14, 2021 at 12:01:14PM -0700, Dan Williams wrote: > > > > > > Does anyone know why devmap is pte_special anyhow? > > > > > > > > It does not need to be special as mentioned here: > > > > > > > > https://lore.kernel.org/all/CAPcyv4iFeVDVPn6uc=aKsyUvkiu3-fK-N16iJVZQ3N8oT00hWA@mail.gmail.com/ > > > > > > I added a remark there > > > > > > Not special means more to me, it means devmap should do the refcounts > > > properly like normal memory pages. > > > > > > It means vm_normal_page should return !NULL and it means insert_page, > > > not insert_pfn should be used to install them in the PTE. VMAs should > > > not be MIXED MAP, but normal struct page maps. > > > > > > I think this change alone would fix all the refcount problems > > > everwhere in DAX and devmap. > > > > > > > The refcount dependencies also go away after this... > > > > > > > > https://lore.kernel.org/all/161604050866.1463742.7759521510383551055.stgit@dwillia2-desk3.amr.corp.intel.com/ > > > > > > > > ...but you can see that patches 1 and 2 in that series depend on being > > > > able to guarantee that all mappings are invalidated when the undelying > > > > device that owns the pgmap goes away. > > > > > > If I have put everything together right this is because of what I > > > pointed to here. FS-DAX is installing 0 refcount pages into PTEs and > > > expecting that to work sanely. > > > > > > This means the page map cannot be removed until all the PTEs are fully > > > flushed, which buggily doesn't happen because of the missing unplug. > > > > > > However, this is all because nobody incrd a refcount to represent the > > > reference in the PTE and since this ment that 0 refcount pages were > > > wrongly stuffed into PTEs then devmap used the refcount == 1 hack to > > > unbreak GUP? > > > > > > So.. Is there some reason why devmap pages are trying so hard to avoid > > > sane refcounting??? > > > > I wouldn't put it that way. It's more that the original sin of > > ZONE_DEVICE that sought to reuse the lru field space, by never having > > a zero recount, then got layered upon and calcified in malignant ways. > > In the meantime surrounding infrastructure got decrustified. Work like > > the 'struct page' cleanup among other things, made it clearer and > > clearer over time that the original design choice needed to be fixed. > > So, there used to be some reason, but with the current code > arrangement it is not the case? This is why it looks so strange when > reading it.. > > AFIACT we are good on the LRU stuff now? Eg release_pages() does not > use page->lru for is_zone_device_page()? Yes, the lru collision was fixed by the 'struct page' cleanup. > > > The MIXED_MAP and insert_pfn were a holdover from page-less DAX, but > > now that we have page-available DAX, yes, we can skip the FS > > notification and just rely on typical refcounting and hanging until > > the FS has a chance to uninstall the PTEs. You're right, the FS > > notification is an improvement to the conversion, not a requirement. > > It all sounds so simple. I looked at this for a good long time and the > first major roadblock is huge pages. > > The mm side is designed around THP which puts a single high order page > into the PUD/PMD such that the mm only has to juggle one page. This a > very sane and reasonable thing to do. > > DAX is stuffing arrays of 4k pages into the PUD/PMDs. Aligning with > THP would make using normal refconting much simpler. I looked at > teaching the mm core to deal with page arrays - it is certainly > doable, but it is quite inefficient and ugly mm code. THP does not support PUD, and neither does FSDAX, so it's only PMDs we need to worry about. > > So, can we fix DAX and TTM - the only uses of PUD/PMDs I could find? > > Joao has a series that does this to device-dax: > > https://lore.kernel.org/all/20210827145819.16471-1-joao.m.martins@oracle.com/ That assumes there's never any need to fracture a huge page which FSDAX could not support unless the filesystem was built with 2MB block size. > TTM is kind of broken already but does have a struct page, and it is > probably already a high order one. Maybe it is OK? I will ask Thomas > > That leaves FSDAX. Can this be fixed? I know nothing of filesystems or > fsdax to guess. Sounds like folios to me .. My thought here is to assemble a compound page on the fly when establishing the FSDAX PMD mapping. > Assuming changing FSDAX is hard.. How would DAX people feel about just > deleting the PUD/PMD support until it can be done with compound pages? There are end users that would notice the PMD regression, and I think FSDAX PMDs with proper compound page metadata is on the same order of work as fixing the refcount. > Doing so would allow fixing the lifecycle, cleaning up gup and > basically delete a huge wack of slow DAX and devmap specific code from > the mm. It also opens the door to removing the PTE flag and thus > allowing DAX on more architectures. > > > However, there still needs to be something in the gup-fast path to > > indicate that GUP_LONGTERM is not possible because the PTE > > represents > > It looks easy now: > > 1) We know the pfn has a struct page * because it isn't pfn special > > 2) We can get a valid ref on the struct page *: > > head = try_grab_compound_head(page, 1, flags); > > Holding a ref ensures that head->pgmap is valid. > > 3) Then check the page type directly: > > if ((flags & FOLL_LONGTERM) && is_zone_device_page(head)) > > This tells us we can access the ZONE_DEVICE struct in the union > > 4) Ask what the pgmap owner wants to do: > > if (head->pgmap->deny_foll_longterm) > return FAIL The pgmap itself does not know, but the "holder" could specify this policy. Which is in line with the 'dax_holder_ops' concept being introduced for reverse mapping support. I.e. when the FS claims the dax-device it can specify at that point that it wants to forbid longterm. > Cost is only paied if FOLL_LONGTERM is given Yeah, that does naturally fall out from no longer needing to take an explicit dev_pagemap reference and assuming a page is always there. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount 2021-10-17 18:35 ` Dan Williams @ 2021-10-18 18:25 ` Jason Gunthorpe 2021-10-18 19:37 ` Dan Williams 0 siblings, 1 reply; 17+ messages in thread From: Jason Gunthorpe @ 2021-10-18 18:25 UTC (permalink / raw) To: Dan Williams Cc: Matthew Wilcox, Alex Sierra, Andrew Morton, Kuehling, Felix, Linux MM, Ralph Campbell, linux-ext4, linux-xfs, amd-gfx list, Maling list - DRI developers, Christoph Hellwig, Jérôme Glisse, Alistair Popple, Vishal Verma, Dave Jiang, Linux NVDIMM, David Hildenbrand, Joao Martins On Sun, Oct 17, 2021 at 11:35:35AM -0700, Dan Williams wrote: > > DAX is stuffing arrays of 4k pages into the PUD/PMDs. Aligning with > > THP would make using normal refconting much simpler. I looked at > > teaching the mm core to deal with page arrays - it is certainly > > doable, but it is quite inefficient and ugly mm code. > > THP does not support PUD, and neither does FSDAX, so it's only PMDs we > need to worry about. device-dax uses PUD, along with TTM, they are the only places. I'm not sure TTM is a real place though. > > So, can we fix DAX and TTM - the only uses of PUD/PMDs I could find? > > > > Joao has a series that does this to device-dax: > > > > https://lore.kernel.org/all/20210827145819.16471-1-joao.m.martins@oracle.com/ > > That assumes there's never any need to fracture a huge page which > FSDAX could not support unless the filesystem was built with 2MB block > size. As I understand things, something like FSDAX post-folio should generate maximal compound pages for extents in the page cache that are physically contiguous. A high order folio can be placed in any lower order in the page tables, so we never have to fracture it, unless the underlying page are moved around - which requires an unmap_mapping_range() cycle.. > > Assuming changing FSDAX is hard.. How would DAX people feel about just > > deleting the PUD/PMD support until it can be done with compound pages? > > There are end users that would notice the PMD regression, and I think > FSDAX PMDs with proper compound page metadata is on the same order of > work as fixing the refcount. Hmm, I don't know.. I sketched out the refcount stuff and the code is OK but ugly and will add a conditional to some THP cases On the other hand, making THP unmap cases a bit slower is probably a net win compared to making put_page a bit slower.. Considering unmap is already quite heavy. > > 4) Ask what the pgmap owner wants to do: > > > > if (head->pgmap->deny_foll_longterm) > > return FAIL > > The pgmap itself does not know, but the "holder" could specify this > policy. Here I imagine the thing that creates the pgmap would specify the policy it wants. In most cases the policy is tightly coupled to what the free function in the the provided dev_pagemap_ops does.. > Which is in line with the 'dax_holder_ops' concept being introduced > for reverse mapping support. I.e. when the FS claims the dax-device > it can specify at that point that it wants to forbid longterm. Which is a reasonable refinment if we think there are cases where two nvdim users would want different things. Anyhow, I'm wondering on a way forward. There are many balls in the air, all linked: - Joao's compound page support for device_dax and more - Alex's DEVICE_COHERENT - The refcount normalization - Removing the pgmap test from GUP - Removing the need for the PUD/PMD/PTE special bit - Removing the need for the PUD/PMD/PTE devmap bit - Remove PUD/PMD vma_is_special - folios for fsdax - shootdown for fsdax Frankly I'm leery to see more ZONE_DEVICE users crop up that depend on the current semantics as that will only make it even harder to fix.. I think it would be good to see Joao's compound page support move ahead.. So.. Does anyone want to work on finishing this patch series?? I can give some guidance on how I think it should work at least Jason ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount 2021-10-18 18:25 ` Jason Gunthorpe @ 2021-10-18 19:37 ` Dan Williams 2021-10-18 23:06 ` Jason Gunthorpe 0 siblings, 1 reply; 17+ messages in thread From: Dan Williams @ 2021-10-18 19:37 UTC (permalink / raw) To: Jason Gunthorpe Cc: Matthew Wilcox, Alex Sierra, Andrew Morton, Kuehling, Felix, Linux MM, Ralph Campbell, linux-ext4, linux-xfs, amd-gfx list, Maling list - DRI developers, Christoph Hellwig, Jérôme Glisse, Alistair Popple, Vishal Verma, Dave Jiang, Linux NVDIMM, David Hildenbrand, Joao Martins On Mon, Oct 18, 2021 at 11:26 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Sun, Oct 17, 2021 at 11:35:35AM -0700, Dan Williams wrote: > > > > DAX is stuffing arrays of 4k pages into the PUD/PMDs. Aligning with > > > THP would make using normal refconting much simpler. I looked at > > > teaching the mm core to deal with page arrays - it is certainly > > > doable, but it is quite inefficient and ugly mm code. > > > > THP does not support PUD, and neither does FSDAX, so it's only PMDs we > > need to worry about. > > device-dax uses PUD, along with TTM, they are the only places. I'm not > sure TTM is a real place though. I was setting device-dax aside because it can use Joao's changes to get compound-page support. > > > > So, can we fix DAX and TTM - the only uses of PUD/PMDs I could find? > > > > > > Joao has a series that does this to device-dax: > > > > > > https://lore.kernel.org/all/20210827145819.16471-1-joao.m.martins@oracle.com/ > > > > That assumes there's never any need to fracture a huge page which > > FSDAX could not support unless the filesystem was built with 2MB block > > size. > > As I understand things, something like FSDAX post-folio should > generate maximal compound pages for extents in the page cache that are > physically contiguous. > > A high order folio can be placed in any lower order in the page > tables, so we never have to fracture it, unless the underlying page > are moved around - which requires an unmap_mapping_range() cycle.. That would be useful to disconnect the compound-page size from the page-table-entry installed for the page. However, don't we need typical compound page fracturing in the near term until folios move ahead? > > > > Assuming changing FSDAX is hard.. How would DAX people feel about just > > > deleting the PUD/PMD support until it can be done with compound pages? > > > > There are end users that would notice the PMD regression, and I think > > FSDAX PMDs with proper compound page metadata is on the same order of > > work as fixing the refcount. > > Hmm, I don't know.. I sketched out the refcount stuff and the code is > OK but ugly and will add a conditional to some THP cases That reminds me that there are several places that do: pmd_devmap(pmd) || pmd_trans_huge(pmd) ...for the common cases where a THP and DEVMAP page are equivalent, but there are a few places where those paths are not shared when the THP path expects that the page came from the page allocator. So while DEVMAP is not needed in GUP after this conversion, there still needs to be an audit of when THP needs to be careful of DAX mappings. > On the other hand, making THP unmap cases a bit slower is probably a > net win compared to making put_page a bit slower.. Considering unmap > is already quite heavy. FSDAX eventually learned how to replace 'struct page' with xarray for several paths, so "how hard could it be" (/famous last words) to replace xarray with 'struct page'? I think the end result will be cleaner, but yes, I expect some dragons in the conversion. > > > > 4) Ask what the pgmap owner wants to do: > > > > > > if (head->pgmap->deny_foll_longterm) > > > return FAIL > > > > The pgmap itself does not know, but the "holder" could specify this > > policy. > > Here I imagine the thing that creates the pgmap would specify the > policy it wants. In most cases the policy is tightly coupled to what > the free function in the the provided dev_pagemap_ops does.. The thing that creates the pgmap is the device-driver, and device-driver does not implement truncate or reclaim. It's not until the FS mounts that the pgmap needs to start enforcing pin lifetime guarantees. > > > Which is in line with the 'dax_holder_ops' concept being introduced > > for reverse mapping support. I.e. when the FS claims the dax-device > > it can specify at that point that it wants to forbid longterm. > > Which is a reasonable refinment if we think there are cases where two > nvdim users would want different things. > It's already the case that device-dax does not enforce transient pin lifetimes. > Anyhow, I'm wondering on a way forward. There are many balls in the > air, all linked: > - Joao's compound page support for device_dax and more > - Alex's DEVICE_COHERENT I have not seen these patches. /me notices no MAINTAINERS mention for include/linux/memremap.h > - The refcount normalization > - Removing the pgmap test from GUP > - Removing the need for the PUD/PMD/PTE special bit > - Removing the need for the PUD/PMD/PTE devmap bit It's not clear that this anything but pure cleanup once the special bit can be used for architectures that don't have devmap. Those same archs presumably don't care about the THP collisions with DAX. > - Remove PUD/PMD vma_is_special > - folios for fsdax > - shootdown for fsdax > > Frankly I'm leery to see more ZONE_DEVICE users crop up that depend on > the current semantics as that will only make it even harder to fix.. > > I think it would be good to see Joao's compound page support move > ahead.. > > So.. Does anyone want to work on finishing this patch series?? I can > give some guidance on how I think it should work at least Completing the DAX reflink work is in my near term goals and that includes "shootdown for fsdax and removing the pgmap test from GUP", but probably not in the order that "refcount normalization" folks would prefer. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount 2021-10-18 19:37 ` Dan Williams @ 2021-10-18 23:06 ` Jason Gunthorpe 2021-10-19 15:13 ` Joao Martins 0 siblings, 1 reply; 17+ messages in thread From: Jason Gunthorpe @ 2021-10-18 23:06 UTC (permalink / raw) To: Dan Williams Cc: Matthew Wilcox, Alex Sierra, Andrew Morton, Kuehling, Felix, Linux MM, Ralph Campbell, linux-ext4, linux-xfs, amd-gfx list, Maling list - DRI developers, Christoph Hellwig, Jérôme Glisse, Alistair Popple, Vishal Verma, Dave Jiang, Linux NVDIMM, David Hildenbrand, Joao Martins On Mon, Oct 18, 2021 at 12:37:30PM -0700, Dan Williams wrote: > > device-dax uses PUD, along with TTM, they are the only places. I'm not > > sure TTM is a real place though. > > I was setting device-dax aside because it can use Joao's changes to > get compound-page support. Ideally, but that ideas in that patch series have been floating around for a long time now.. > > As I understand things, something like FSDAX post-folio should > > generate maximal compound pages for extents in the page cache that are > > physically contiguous. > > > > A high order folio can be placed in any lower order in the page > > tables, so we never have to fracture it, unless the underlying page > > are moved around - which requires an unmap_mapping_range() cycle.. > > That would be useful to disconnect the compound-page size from the > page-table-entry installed for the page. However, don't we need > typical compound page fracturing in the near term until folios move > ahead? I do not know, just mindful not to get ahead of Matthew > > > There are end users that would notice the PMD regression, and I think > > > FSDAX PMDs with proper compound page metadata is on the same order of > > > work as fixing the refcount. > > > > Hmm, I don't know.. I sketched out the refcount stuff and the code is > > OK but ugly and will add a conditional to some THP cases > > That reminds me that there are several places that do: > > pmd_devmap(pmd) || pmd_trans_huge(pmd) I haven't tried to look at this yet. I did check that the pte_devmap() flag can be deleted, but this is more tricky. We have pmd_huge(), pmd_large(), pmd_devmap(), pmd_trans_huge(), pmd_leaf(), at least and I couldn't tell you today the subtle differences between all of these things on every arch :\ AFAIK there should only be three case: - pmd points to a pte table - pmd is in the special hugetlb format - pmd points at something described by struct page(s) > ...for the common cases where a THP and DEVMAP page are equivalent, > but there are a few places where those paths are not shared when the > THP path expects that the page came from the page allocator. So while > DEVMAP is not needed in GUP after this conversion, there still needs > to be an audit of when THP needs to be careful of DAX mappings. Yes, it is a tricky job to do the full work, but I think in the end, 'pmd points at something described by struct page(s)' is enough for all code to use is_zone_device_page() instead of a PTE bit or VMA flag to drive its logic. > > Here I imagine the thing that creates the pgmap would specify the > > policy it wants. In most cases the policy is tightly coupled to what > > the free function in the the provided dev_pagemap_ops does.. > > The thing that creates the pgmap is the device-driver, and > device-driver does not implement truncate or reclaim. It's not until > the FS mounts that the pgmap needs to start enforcing pin lifetime > guarantees. I am explaining this wrong, the immediate need is really 'should foll_longterm fail fast-gup to the slow path' and something like the nvdimm driver can just set that to 1 and rely on VMA flags to control what the slow path does - as is today. It is not as elegant as more flags in the pgmap, but it would get the job done with minimal fuss. Might be nice to either rely fully on VMA flags or fully on pgmap holder flags for FOLL_LONGTERM? > > Anyhow, I'm wondering on a way forward. There are many balls in the > > air, all linked: > > - Joao's compound page support for device_dax and more > > - Alex's DEVICE_COHERENT > > I have not seen these patches. It is where this series came from. As DEVICE_COHERENT is focused on changing the migration code and, as I recall, the 1 == free thing complicated that enough that Christoph requested it be cleaned. > > - The refcount normalization > > - Removing the pgmap test from GUP > > - Removing the need for the PUD/PMD/PTE special bit > > - Removing the need for the PUD/PMD/PTE devmap bit > > It's not clear that this anything but pure cleanup once the special > bit can be used for architectures that don't have devmap. Those same > archs presumably don't care about the THP collisions with DAX. I understood there was some community that was interested in DAX on other arches that don't have the PTE bits to spare, so this would be of interest to them? > Completing the DAX reflink work is in my near term goals and that > includes "shootdown for fsdax and removing the pgmap test from GUP", > but probably not in the order that "refcount normalization" folks > would prefer. Indeed, I don't think that will help many of the stuck items on the list move ahead. Jason ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount 2021-10-18 23:06 ` Jason Gunthorpe @ 2021-10-19 15:13 ` Joao Martins 2021-10-19 16:01 ` Jason Gunthorpe 0 siblings, 1 reply; 17+ messages in thread From: Joao Martins @ 2021-10-19 15:13 UTC (permalink / raw) To: Jason Gunthorpe, Dan Williams Cc: Matthew Wilcox, Alex Sierra, Andrew Morton, Kuehling, Felix, Linux MM, Ralph Campbell, linux-ext4, linux-xfs, amd-gfx list, Maling list - DRI developers, Christoph Hellwig, Jérôme Glisse, Alistair Popple, Vishal Verma, Dave Jiang, Linux NVDIMM, David Hildenbrand On 10/19/21 00:06, Jason Gunthorpe wrote: > On Mon, Oct 18, 2021 at 12:37:30PM -0700, Dan Williams wrote: > >>> device-dax uses PUD, along with TTM, they are the only places. I'm not >>> sure TTM is a real place though. >> >> I was setting device-dax aside because it can use Joao's changes to >> get compound-page support. > > Ideally, but that ideas in that patch series have been floating around > for a long time now.. > The current status of the series misses a Rb on patches 6,7,10,12-14. Well, patch 8 too should now drop its tag, considering the latest discussion. If it helps moving things forward I could split my series further into: 1) the compound page introduction (patches 1-7) of my aforementioned series 2) vmemmap deduplication for memory gains (patches 9-14) 3) gup improvements (patch 8 and gup-slow improvements) The reason being that item 1) is the the main dependency listed below. And allows 2) and 3) to be parallelized. FWIW, it is almost fully reviewed by Dan (as of v3->v4). For (1) patches 6 & 7 are on changes to device-dax subsystem (drivers/dax/*) which still needs his Ack. >>> Here I imagine the thing that creates the pgmap would specify the >>> policy it wants. In most cases the policy is tightly coupled to what >>> the free function in the the provided dev_pagemap_ops does.. >> >> The thing that creates the pgmap is the device-driver, and >> device-driver does not implement truncate or reclaim. It's not until >> the FS mounts that the pgmap needs to start enforcing pin lifetime >> guarantees. > > I am explaining this wrong, the immediate need is really 'should > foll_longterm fail fast-gup to the slow path' and something like the > nvdimm driver can just set that to 1 and rely on VMA flags to control > what the slow path does - as is today. > > It is not as elegant as more flags in the pgmap, but it would get the > job done with minimal fuss. > > Might be nice to either rely fully on VMA flags or fully on pgmap > holder flags for FOLL_LONGTERM? > Whats the benefit between preventing longterm at start versus only after mounting the filesystem? Or is the intended future purpose to pass more context into an holder potential future callback e.g. nack longterm pins on a page basis? Maybe we can start by at least not add any flags and just prevent FOLL_LONGTERM on fsdax -- which I guess was the original purpose of commit 7af75561e171 ("mm/gup: add FOLL_LONGTERM capability to GUP fast"). This patch (which I can formally send) has a sketch of that (below scissors mark): https://lore.kernel.org/linux-mm/6a18179e-65f7-367d-89a9-d5162f10fef0@oracle.com/ It uses pgmap->type rather than adding further fields into pgmap, given this restriction applies only to fsdax. ... and then we could improve devmap_longterm_available(pgmap) to look at the holder::flags or pgmap::flags should we decide that an explicit flags is required from holder/pgmap .. as a further improvement? Joao ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount 2021-10-19 15:13 ` Joao Martins @ 2021-10-19 16:01 ` Jason Gunthorpe 2021-10-19 19:21 ` Dan Williams 0 siblings, 1 reply; 17+ messages in thread From: Jason Gunthorpe @ 2021-10-19 16:01 UTC (permalink / raw) To: Joao Martins Cc: Dan Williams, Matthew Wilcox, Alex Sierra, Andrew Morton, Kuehling, Felix, Linux MM, Ralph Campbell, linux-ext4, linux-xfs, amd-gfx list, Maling list - DRI developers, Christoph Hellwig, Jérôme Glisse, Alistair Popple, Vishal Verma, Dave Jiang, Linux NVDIMM, David Hildenbrand On Tue, Oct 19, 2021 at 04:13:34PM +0100, Joao Martins wrote: > On 10/19/21 00:06, Jason Gunthorpe wrote: > > On Mon, Oct 18, 2021 at 12:37:30PM -0700, Dan Williams wrote: > > > >>> device-dax uses PUD, along with TTM, they are the only places. I'm not > >>> sure TTM is a real place though. > >> > >> I was setting device-dax aside because it can use Joao's changes to > >> get compound-page support. > > > > Ideally, but that ideas in that patch series have been floating around > > for a long time now.. > > > The current status of the series misses a Rb on patches 6,7,10,12-14. > Well, patch 8 too should now drop its tag, considering the latest > discussion. > > If it helps moving things forward I could split my series further into: > > 1) the compound page introduction (patches 1-7) of my aforementioned series > 2) vmemmap deduplication for memory gains (patches 9-14) > 3) gup improvements (patch 8 and gup-slow improvements) I would split it, yes.. I think we can see a general consensus that making compound_head/etc work consistently with how THP uses it will provide value and opportunity for optimization going forward. > Whats the benefit between preventing longterm at start > versus only after mounting the filesystem? Or is the intended future purpose > to pass more context into an holder potential future callback e.g. nack longterm > pins on a page basis? I understood Dan's remark that the device-dax path allows FOLL_LONGTERM and the FSDAX path does not ? Which, IIRC, today is signaled basd on vma properties and in all cases fast-gup is denied. > Maybe we can start by at least not add any flags and just prevent > FOLL_LONGTERM on fsdax -- which I guess was the original purpose of > commit 7af75561e171 ("mm/gup: add FOLL_LONGTERM capability to GUP fast"). > This patch (which I can formally send) has a sketch of that (below scissors mark): > > https://lore.kernel.org/linux-mm/6a18179e-65f7-367d-89a9-d5162f10fef0@oracle.com/ Yes, basically, whatever test we want for 'deny fast gup foll longterm' is fine. Personally I'd like to see us move toward a set of flag specifying each special behavior and not a collection of types that imply special behaviors. Eg we have at least: - Block gup fast on foll_longterm - Capture the refcount ==1 and use the pgmap free hook (confusingly called page_is_devmap_managed()) - Always use a swap entry - page->index/mapping are used in the usual file based way? Probably more things.. Jason ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount 2021-10-19 16:01 ` Jason Gunthorpe @ 2021-10-19 19:21 ` Dan Williams 2021-10-20 17:06 ` Joao Martins 0 siblings, 1 reply; 17+ messages in thread From: Dan Williams @ 2021-10-19 19:21 UTC (permalink / raw) To: Jason Gunthorpe Cc: Joao Martins, Matthew Wilcox, Alex Sierra, Andrew Morton, Kuehling, Felix, Linux MM, Ralph Campbell, linux-ext4, linux-xfs, amd-gfx list, Maling list - DRI developers, Christoph Hellwig, Jérôme Glisse, Alistair Popple, Vishal Verma, Dave Jiang, Linux NVDIMM, David Hildenbrand On Tue, Oct 19, 2021 at 9:02 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Tue, Oct 19, 2021 at 04:13:34PM +0100, Joao Martins wrote: > > On 10/19/21 00:06, Jason Gunthorpe wrote: > > > On Mon, Oct 18, 2021 at 12:37:30PM -0700, Dan Williams wrote: > > > > > >>> device-dax uses PUD, along with TTM, they are the only places. I'm not > > >>> sure TTM is a real place though. > > >> > > >> I was setting device-dax aside because it can use Joao's changes to > > >> get compound-page support. > > > > > > Ideally, but that ideas in that patch series have been floating around > > > for a long time now.. > > > > > The current status of the series misses a Rb on patches 6,7,10,12-14. > > Well, patch 8 too should now drop its tag, considering the latest > > discussion. > > > > If it helps moving things forward I could split my series further into: > > > > 1) the compound page introduction (patches 1-7) of my aforementioned series > > 2) vmemmap deduplication for memory gains (patches 9-14) > > 3) gup improvements (patch 8 and gup-slow improvements) > > I would split it, yes.. > > I think we can see a general consensus that making compound_head/etc > work consistently with how THP uses it will provide value and > opportunity for optimization going forward. > > > Whats the benefit between preventing longterm at start > > versus only after mounting the filesystem? Or is the intended future purpose > > to pass more context into an holder potential future callback e.g. nack longterm > > pins on a page basis? > > I understood Dan's remark that the device-dax path allows > FOLL_LONGTERM and the FSDAX path does not ? > > Which, IIRC, today is signaled basd on vma properties and in all cases > fast-gup is denied. Yeah, I forgot that 7af75561e171 eliminated any possibility of longterm-gup-fast for device-dax, let's not disturb that status quo. > > Maybe we can start by at least not add any flags and just prevent > > FOLL_LONGTERM on fsdax -- which I guess was the original purpose of > > commit 7af75561e171 ("mm/gup: add FOLL_LONGTERM capability to GUP fast"). > > This patch (which I can formally send) has a sketch of that (below scissors mark): > > > > https://lore.kernel.org/linux-mm/6a18179e-65f7-367d-89a9-d5162f10fef0@oracle.com/ > > Yes, basically, whatever test we want for 'deny fast gup foll > longterm' is fine. > > Personally I'd like to see us move toward a set of flag specifying > each special behavior and not a collection of types that imply special > behaviors. > > Eg we have at least: > - Block gup fast on foll_longterm > - Capture the refcount ==1 and use the pgmap free hook > (confusingly called page_is_devmap_managed()) > - Always use a swap entry > - page->index/mapping are used in the usual file based way? > > Probably more things.. Yes, agree with the principle of reducing type-implied special casing. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount 2021-10-19 19:21 ` Dan Williams @ 2021-10-20 17:06 ` Joao Martins 2021-10-20 17:12 ` Dan Williams 0 siblings, 1 reply; 17+ messages in thread From: Joao Martins @ 2021-10-20 17:06 UTC (permalink / raw) To: Dan Williams, Jason Gunthorpe Cc: Matthew Wilcox, Alex Sierra, Andrew Morton, Kuehling, Felix, Linux MM, Ralph Campbell, linux-ext4, linux-xfs, amd-gfx list, Maling list - DRI developers, Christoph Hellwig, Jérôme Glisse, Alistair Popple, Vishal Verma, Dave Jiang, Linux NVDIMM, David Hildenbrand On 10/19/21 20:21, Dan Williams wrote: > On Tue, Oct 19, 2021 at 9:02 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: >> >> On Tue, Oct 19, 2021 at 04:13:34PM +0100, Joao Martins wrote: >>> On 10/19/21 00:06, Jason Gunthorpe wrote: >>>> On Mon, Oct 18, 2021 at 12:37:30PM -0700, Dan Williams wrote: >>>> >>>>>> device-dax uses PUD, along with TTM, they are the only places. I'm not >>>>>> sure TTM is a real place though. >>>>> >>>>> I was setting device-dax aside because it can use Joao's changes to >>>>> get compound-page support. >>>> >>>> Ideally, but that ideas in that patch series have been floating around >>>> for a long time now.. >>>> >>> The current status of the series misses a Rb on patches 6,7,10,12-14. >>> Well, patch 8 too should now drop its tag, considering the latest >>> discussion. >>> >>> If it helps moving things forward I could split my series further into: >>> >>> 1) the compound page introduction (patches 1-7) of my aforementioned series >>> 2) vmemmap deduplication for memory gains (patches 9-14) >>> 3) gup improvements (patch 8 and gup-slow improvements) >> >> I would split it, yes.. >> >> I think we can see a general consensus that making compound_head/etc >> work consistently with how THP uses it will provide value and >> opportunity for optimization going forward. >> I'll go do that. Meanwhile, I'll wait a couple days for Dan to review the dax subsystem patches (6 & 7), or otherwise send them over. >>> Whats the benefit between preventing longterm at start >>> versus only after mounting the filesystem? Or is the intended future purpose >>> to pass more context into an holder potential future callback e.g. nack longterm >>> pins on a page basis? >> >> I understood Dan's remark that the device-dax path allows >> FOLL_LONGTERM and the FSDAX path does not ? >> >> Which, IIRC, today is signaled basd on vma properties and in all cases >> fast-gup is denied. > > Yeah, I forgot that 7af75561e171 eliminated any possibility of > longterm-gup-fast for device-dax, let's not disturb that status quo. > I am slightly confused by this comment -- the status quo is what we are questioning here -- And we talked about changing that in the past too (thread below), that longterm-gup-fast was an oversight that that commit was only applicable to fsdax. [Maybe this is just my english confusion] >>> Maybe we can start by at least not add any flags and just prevent >>> FOLL_LONGTERM on fsdax -- which I guess was the original purpose of >>> commit 7af75561e171 ("mm/gup: add FOLL_LONGTERM capability to GUP fast"). >>> This patch (which I can formally send) has a sketch of that (below scissors mark): >>> >>> https://lore.kernel.org/linux-mm/6a18179e-65f7-367d-89a9-d5162f10fef0@oracle.com/ >> >> Yes, basically, whatever test we want for 'deny fast gup foll >> longterm' is fine. >> >> Personally I'd like to see us move toward a set of flag specifying >> each special behavior and not a collection of types that imply special >> behaviors. >> >> Eg we have at least: >> - Block gup fast on foll_longterm >> - Capture the refcount ==1 and use the pgmap free hook >> (confusingly called page_is_devmap_managed()) >> - Always use a swap entry >> - page->index/mapping are used in the usual file based way? >> >> Probably more things.. > > Yes, agree with the principle of reducing type-implied special casing. > OK. Moving from implicit devmap types to pgmap::flags is rather simple fixup. And I suppose (respectivally) PGMAP_NO_PINF_LONGTERM, PGMAP_MANAGED_FREE_PAGE, PGMAP_USE_SWAP_ENTRY, etc, etc. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount 2021-10-20 17:06 ` Joao Martins @ 2021-10-20 17:12 ` Dan Williams 2021-10-20 18:51 ` Joao Martins 0 siblings, 1 reply; 17+ messages in thread From: Dan Williams @ 2021-10-20 17:12 UTC (permalink / raw) To: Joao Martins Cc: Jason Gunthorpe, Matthew Wilcox, Alex Sierra, Andrew Morton, Kuehling, Felix, Linux MM, Ralph Campbell, linux-ext4, linux-xfs, amd-gfx list, Maling list - DRI developers, Christoph Hellwig, Jérôme Glisse, Alistair Popple, Vishal Verma, Dave Jiang, Linux NVDIMM, David Hildenbrand On Wed, Oct 20, 2021 at 10:09 AM Joao Martins <joao.m.martins@oracle.com> wrote: > > On 10/19/21 20:21, Dan Williams wrote: > > On Tue, Oct 19, 2021 at 9:02 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > >> > >> On Tue, Oct 19, 2021 at 04:13:34PM +0100, Joao Martins wrote: > >>> On 10/19/21 00:06, Jason Gunthorpe wrote: > >>>> On Mon, Oct 18, 2021 at 12:37:30PM -0700, Dan Williams wrote: > >>>> > >>>>>> device-dax uses PUD, along with TTM, they are the only places. I'm not > >>>>>> sure TTM is a real place though. > >>>>> > >>>>> I was setting device-dax aside because it can use Joao's changes to > >>>>> get compound-page support. > >>>> > >>>> Ideally, but that ideas in that patch series have been floating around > >>>> for a long time now.. > >>>> > >>> The current status of the series misses a Rb on patches 6,7,10,12-14. > >>> Well, patch 8 too should now drop its tag, considering the latest > >>> discussion. > >>> > >>> If it helps moving things forward I could split my series further into: > >>> > >>> 1) the compound page introduction (patches 1-7) of my aforementioned series > >>> 2) vmemmap deduplication for memory gains (patches 9-14) > >>> 3) gup improvements (patch 8 and gup-slow improvements) > >> > >> I would split it, yes.. > >> > >> I think we can see a general consensus that making compound_head/etc > >> work consistently with how THP uses it will provide value and > >> opportunity for optimization going forward. > >> > > I'll go do that. Meanwhile, I'll wait a couple days for Dan to review the > dax subsystem patches (6 & 7), or otherwise send them over. > > >>> Whats the benefit between preventing longterm at start > >>> versus only after mounting the filesystem? Or is the intended future purpose > >>> to pass more context into an holder potential future callback e.g. nack longterm > >>> pins on a page basis? > >> > >> I understood Dan's remark that the device-dax path allows > >> FOLL_LONGTERM and the FSDAX path does not ? > >> > >> Which, IIRC, today is signaled basd on vma properties and in all cases > >> fast-gup is denied. > > > > Yeah, I forgot that 7af75561e171 eliminated any possibility of > > longterm-gup-fast for device-dax, let's not disturb that status quo. > > > I am slightly confused by this comment -- the status quo is what we are > questioning here -- And we talked about changing that in the past too > (thread below), that longterm-gup-fast was an oversight that that commit > was only applicable to fsdax. [Maybe this is just my english confusion] No, you have it correct. However that "regression" has gone unnoticed, so unless there is data showing that gup-fast on device-dax is critical for longterm page pinning workflows I'm ok for longterm to continue to trigger gup-slow. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount 2021-10-20 17:12 ` Dan Williams @ 2021-10-20 18:51 ` Joao Martins 0 siblings, 0 replies; 17+ messages in thread From: Joao Martins @ 2021-10-20 18:51 UTC (permalink / raw) To: Dan Williams Cc: Jason Gunthorpe, Matthew Wilcox, Alex Sierra, Andrew Morton, Kuehling, Felix, Linux MM, Ralph Campbell, linux-ext4, linux-xfs, amd-gfx list, Maling list - DRI developers, Christoph Hellwig, Jérôme Glisse, Alistair Popple, Vishal Verma, Dave Jiang, Linux NVDIMM, David Hildenbrand On 10/20/21 18:12, Dan Williams wrote: > On Wed, Oct 20, 2021 at 10:09 AM Joao Martins <joao.m.martins@oracle.com> wrote: >> On 10/19/21 20:21, Dan Williams wrote: >>> On Tue, Oct 19, 2021 at 9:02 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: >>>> On Tue, Oct 19, 2021 at 04:13:34PM +0100, Joao Martins wrote: >>>>> On 10/19/21 00:06, Jason Gunthorpe wrote: >>>>>> On Mon, Oct 18, 2021 at 12:37:30PM -0700, Dan Williams wrote: >>>>> Whats the benefit between preventing longterm at start >>>>> versus only after mounting the filesystem? Or is the intended future purpose >>>>> to pass more context into an holder potential future callback e.g. nack longterm >>>>> pins on a page basis? >>>> >>>> I understood Dan's remark that the device-dax path allows >>>> FOLL_LONGTERM and the FSDAX path does not ? >>>> >>>> Which, IIRC, today is signaled basd on vma properties and in all cases >>>> fast-gup is denied. >>> >>> Yeah, I forgot that 7af75561e171 eliminated any possibility of >>> longterm-gup-fast for device-dax, let's not disturb that status quo. >>> >> I am slightly confused by this comment -- the status quo is what we are >> questioning here -- And we talked about changing that in the past too >> (thread below), that longterm-gup-fast was an oversight that that commit >> was only applicable to fsdax. [Maybe this is just my english confusion] > > No, you have it correct. However that "regression" has gone unnoticed, > so unless there is data showing that gup-fast on device-dax is > critical for longterm page pinning workflows I'm ok for longterm to > continue to trigger gup-slow. > To be fair, it's not surprising that nobody noticed -- my general intent was just to special-case less for device-dax. Not many places use pin_user_pages_fast(FOLL_LONGTERM). This is only exposed on those few cases that do try to use it (e.g. RDMA/IB, vDPA), and specifically when the page fault occurs (that requires fallback-ing to gup-slow) at a higher level than the amount you're pinning e.g. pinning in 2M extents on a device-dax of 1G pagesize. Pin size is limited to a 2M extent at a time by the users I mentioned above -- regardless of the total size of the extent you will be pinning (i.e. 512 struct pages pointers fit one page). But even with all this, this [FOLL_LONGTERM on pin-fast] would still go unnoticed because gup-fast on devmap is just as slow as gup :) ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2021-10-20 18:52 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <20211014153928.16805-1-alex.sierra@amd.com> [not found] ` <20211014153928.16805-3-alex.sierra@amd.com> [not found] ` <20211014170634.GV2744544@nvidia.com> 2021-10-14 18:43 ` [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount Matthew Wilcox 2021-10-14 19:01 ` Dan Williams 2021-10-14 23:06 ` Jason Gunthorpe 2021-10-15 1:37 ` Dan Williams 2021-10-16 15:44 ` Jason Gunthorpe 2021-10-16 16:39 ` Matthew Wilcox 2021-10-17 18:20 ` Dan Williams 2021-10-17 18:35 ` Dan Williams 2021-10-18 18:25 ` Jason Gunthorpe 2021-10-18 19:37 ` Dan Williams 2021-10-18 23:06 ` Jason Gunthorpe 2021-10-19 15:13 ` Joao Martins 2021-10-19 16:01 ` Jason Gunthorpe 2021-10-19 19:21 ` Dan Williams 2021-10-20 17:06 ` Joao Martins 2021-10-20 17:12 ` Dan Williams 2021-10-20 18:51 ` Joao Martins
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).