From: Dan Williams <firstname.lastname@example.org>
To: Jason Gunthorpe <email@example.com>
Cc: "Matthew Wilcox" <firstname.lastname@example.org>,
"Alex Sierra" <email@example.com>,
"Andrew Morton" <firstname.lastname@example.org>,
"Kuehling, Felix" <Felix.Kuehling@amd.com>,
"Linux MM" <email@example.com>,
"Ralph Campbell" <firstname.lastname@example.org>,
"amd-gfx list" <email@example.com>,
"Maling list - DRI developers" <firstname.lastname@example.org>,
"Christoph Hellwig" <email@example.com>,
"Jérôme Glisse" <firstname.lastname@example.org>,
"Alistair Popple" <email@example.com>,
"Vishal Verma" <firstname.lastname@example.org>,
"Dave Jiang" <email@example.com>,
"Linux NVDIMM" <firstname.lastname@example.org>,
"David Hildenbrand" <email@example.com>,
"Joao Martins" <firstname.lastname@example.org>
Subject: Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount
Date: Mon, 18 Oct 2021 12:37:30 -0700 [thread overview]
Message-ID: <CAPcyv4jvZjeMcKLVuOEQ_gXRd87i3NUX5D=MmsJ++rWafnK-NQ@mail.gmail.com> (raw)
On Mon, Oct 18, 2021 at 11:26 AM Jason Gunthorpe <email@example.com> wrote:
> On Sun, Oct 17, 2021 at 11:35:35AM -0700, Dan Williams wrote:
> > > DAX is stuffing arrays of 4k pages into the PUD/PMDs. Aligning with
> > > THP would make using normal refconting much simpler. I looked at
> > > teaching the mm core to deal with page arrays - it is certainly
> > > doable, but it is quite inefficient and ugly mm code.
> > THP does not support PUD, and neither does FSDAX, so it's only PMDs we
> > need to worry about.
> device-dax uses PUD, along with TTM, they are the only places. I'm not
> sure TTM is a real place though.
I was setting device-dax aside because it can use Joao's changes to
get compound-page support.
> > > So, can we fix DAX and TTM - the only uses of PUD/PMDs I could find?
> > >
> > > Joao has a series that does this to device-dax:
> > >
> > > https://firstname.lastname@example.org/
> > That assumes there's never any need to fracture a huge page which
> > FSDAX could not support unless the filesystem was built with 2MB block
> > size.
> As I understand things, something like FSDAX post-folio should
> generate maximal compound pages for extents in the page cache that are
> physically contiguous.
> A high order folio can be placed in any lower order in the page
> tables, so we never have to fracture it, unless the underlying page
> are moved around - which requires an unmap_mapping_range() cycle..
That would be useful to disconnect the compound-page size from the
page-table-entry installed for the page. However, don't we need
typical compound page fracturing in the near term until folios move
> > > Assuming changing FSDAX is hard.. How would DAX people feel about just
> > > deleting the PUD/PMD support until it can be done with compound pages?
> > There are end users that would notice the PMD regression, and I think
> > FSDAX PMDs with proper compound page metadata is on the same order of
> > work as fixing the refcount.
> Hmm, I don't know.. I sketched out the refcount stuff and the code is
> OK but ugly and will add a conditional to some THP cases
That reminds me that there are several places that do:
pmd_devmap(pmd) || pmd_trans_huge(pmd)
...for the common cases where a THP and DEVMAP page are equivalent,
but there are a few places where those paths are not shared when the
THP path expects that the page came from the page allocator. So while
DEVMAP is not needed in GUP after this conversion, there still needs
to be an audit of when THP needs to be careful of DAX mappings.
> On the other hand, making THP unmap cases a bit slower is probably a
> net win compared to making put_page a bit slower.. Considering unmap
> is already quite heavy.
FSDAX eventually learned how to replace 'struct page' with xarray for
several paths, so "how hard could it be" (/famous last words) to
replace xarray with 'struct page'? I think the end result will be
cleaner, but yes, I expect some dragons in the conversion.
> > > 4) Ask what the pgmap owner wants to do:
> > >
> > > if (head->pgmap->deny_foll_longterm)
> > > return FAIL
> > The pgmap itself does not know, but the "holder" could specify this
> > policy.
> Here I imagine the thing that creates the pgmap would specify the
> policy it wants. In most cases the policy is tightly coupled to what
> the free function in the the provided dev_pagemap_ops does..
The thing that creates the pgmap is the device-driver, and
device-driver does not implement truncate or reclaim. It's not until
the FS mounts that the pgmap needs to start enforcing pin lifetime
> > Which is in line with the 'dax_holder_ops' concept being introduced
> > for reverse mapping support. I.e. when the FS claims the dax-device
> > it can specify at that point that it wants to forbid longterm.
> Which is a reasonable refinment if we think there are cases where two
> nvdim users would want different things.
It's already the case that device-dax does not enforce transient pin lifetimes.
> Anyhow, I'm wondering on a way forward. There are many balls in the
> air, all linked:
> - Joao's compound page support for device_dax and more
> - Alex's DEVICE_COHERENT
I have not seen these patches.
/me notices no MAINTAINERS mention for include/linux/memremap.h
> - The refcount normalization
> - Removing the pgmap test from GUP
> - Removing the need for the PUD/PMD/PTE special bit
> - Removing the need for the PUD/PMD/PTE devmap bit
It's not clear that this anything but pure cleanup once the special
bit can be used for architectures that don't have devmap. Those same
archs presumably don't care about the THP collisions with DAX.
> - Remove PUD/PMD vma_is_special
> - folios for fsdax
> - shootdown for fsdax
> Frankly I'm leery to see more ZONE_DEVICE users crop up that depend on
> the current semantics as that will only make it even harder to fix..
> I think it would be good to see Joao's compound page support move
> So.. Does anyone want to work on finishing this patch series?? I can
> give some guidance on how I think it should work at least
Completing the DAX reflink work is in my near term goals and that
includes "shootdown for fsdax and removing the pgmap test from GUP",
but probably not in the order that "refcount normalization" folks
next prev parent reply other threads:[~2021-10-18 19:37 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-10-14 15:39 [PATCH v1 0/2] mm: remove extra ZONE_DEVICE struct page refcount Alex Sierra
2021-10-14 15:39 ` [PATCH v1 1/2] ext4/xfs: add page refcount helper Alex Sierra
2021-10-14 16:25 ` Jason Gunthorpe
2021-10-14 16:40 ` Matthew Wilcox
2021-10-14 15:39 ` [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount Alex Sierra
2021-10-14 16:52 ` Matthew Wilcox
2021-10-14 17:06 ` Jason Gunthorpe
2021-10-14 17:35 ` Ralph Campbell
2021-10-14 18:01 ` Jason Gunthorpe
2021-10-14 20:57 ` Ralph Campbell
2021-10-15 3:45 ` Sierra Guiza, Alejandro (Alex)
2021-10-15 11:06 ` Jason Gunthorpe
2021-10-14 18:43 ` Matthew Wilcox
2021-10-14 19:01 ` Dan Williams
2021-10-14 23:06 ` Jason Gunthorpe
2021-10-15 1:37 ` Dan Williams
2021-10-16 15:44 ` Jason Gunthorpe
2021-10-16 16:39 ` Matthew Wilcox
2021-10-17 18:20 ` Dan Williams
2021-10-17 18:35 ` Dan Williams
2021-10-18 18:25 ` Jason Gunthorpe
2021-10-18 19:37 ` Dan Williams [this message]
2021-10-18 23:06 ` Jason Gunthorpe
2021-10-19 15:13 ` Joao Martins
2021-10-19 16:01 ` Jason Gunthorpe
2021-10-19 19:21 ` Dan Williams
2021-10-20 17:06 ` Joao Martins
2021-10-20 17:12 ` Dan Williams
2021-10-20 18:51 ` Joao Martins
2021-11-15 19:33 [PATCH v1 0/2] Remove extra ZONE_DEVICE " Alex Sierra
2021-11-15 19:33 ` [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct " Alex Sierra
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).