linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: "Matthew Wilcox" <willy@infradead.org>,
	"Alex Sierra" <alex.sierra@amd.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Kuehling, Felix" <Felix.Kuehling@amd.com>,
	"Linux MM" <linux-mm@kvack.org>,
	"Ralph Campbell" <rcampbell@nvidia.com>,
	linux-ext4 <linux-ext4@vger.kernel.org>,
	linux-xfs <linux-xfs@vger.kernel.org>,
	"amd-gfx list" <amd-gfx@lists.freedesktop.org>,
	"Maling list - DRI developers" <dri-devel@lists.freedesktop.org>,
	"Christoph Hellwig" <hch@lst.de>,
	"Jérôme Glisse" <jglisse@redhat.com>,
	"Alistair Popple" <apopple@nvidia.com>,
	"Vishal Verma" <vishal.l.verma@intel.com>,
	"Dave Jiang" <dave.jiang@intel.com>,
	"Linux NVDIMM" <nvdimm@lists.linux.dev>,
	"David Hildenbrand" <david@redhat.com>
Subject: Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount
Date: Sun, 17 Oct 2021 11:35:35 -0700	[thread overview]
Message-ID: <CAPcyv4j0kHREAOG6_07E2foz6e4FP8D72mZXH6ivsiUBu_8c6g@mail.gmail.com> (raw)
In-Reply-To: <20211016154450.GJ2744544@nvidia.com>

On Sat, Oct 16, 2021 at 8:45 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Oct 14, 2021 at 06:37:35PM -0700, Dan Williams wrote:
> > On Thu, Oct 14, 2021 at 4:06 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Thu, Oct 14, 2021 at 12:01:14PM -0700, Dan Williams wrote:
> > > > > > Does anyone know why devmap is pte_special anyhow?
> > > >
> > > > It does not need to be special as mentioned here:
> > > >
> > > > https://lore.kernel.org/all/CAPcyv4iFeVDVPn6uc=aKsyUvkiu3-fK-N16iJVZQ3N8oT00hWA@mail.gmail.com/
> > >
> > > I added a remark there
> > >
> > > Not special means more to me, it means devmap should do the refcounts
> > > properly like normal memory pages.
> > >
> > > It means vm_normal_page should return !NULL and it means insert_page,
> > > not insert_pfn should be used to install them in the PTE. VMAs should
> > > not be MIXED MAP, but normal struct page maps.
> > >
> > > I think this change alone would fix all the refcount problems
> > > everwhere in DAX and devmap.
> > >
> > > > The refcount dependencies also go away after this...
> > > >
> > > > https://lore.kernel.org/all/161604050866.1463742.7759521510383551055.stgit@dwillia2-desk3.amr.corp.intel.com/
> > > >
> > > > ...but you can see that patches 1 and 2 in that series depend on being
> > > > able to guarantee that all mappings are invalidated when the undelying
> > > > device that owns the pgmap goes away.
> > >
> > > If I have put everything together right this is because of what I
> > > pointed to here. FS-DAX is installing 0 refcount pages into PTEs and
> > > expecting that to work sanely.
> > >
> > > This means the page map cannot be removed until all the PTEs are fully
> > > flushed, which buggily doesn't happen because of the missing unplug.
> > >
> > > However, this is all because nobody incrd a refcount to represent the
> > > reference in the PTE and since this ment that 0 refcount pages were
> > > wrongly stuffed into PTEs then devmap used the refcount == 1 hack to
> > > unbreak GUP?
> > >
> > > So.. Is there some reason why devmap pages are trying so hard to avoid
> > > sane refcounting???
> >
> > I wouldn't put it that way. It's more that the original sin of
> > ZONE_DEVICE that sought to reuse the lru field space, by never having
> > a zero recount, then got layered upon and calcified in malignant ways.
> > In the meantime surrounding infrastructure got decrustified. Work like
> > the 'struct page' cleanup among other things, made it clearer and
> > clearer over time that the original design choice needed to be fixed.
>
> So, there used to be some reason, but with the current code
> arrangement it is not the case? This is why it looks so strange when
> reading it..
>
> AFIACT we are good on the LRU stuff now? Eg release_pages() does not
> use page->lru for is_zone_device_page()?

Yes, the lru collision was fixed by the 'struct page' cleanup.

>
> > The MIXED_MAP and insert_pfn were a holdover from page-less DAX, but
> > now that we have page-available DAX, yes, we can skip the FS
> > notification and just rely on typical refcounting and hanging until
> > the FS has a chance to uninstall the PTEs. You're right, the FS
> > notification is an improvement to the conversion, not a requirement.
>
> It all sounds so simple. I looked at this for a good long time and the
> first major roadblock is huge pages.
>
> The mm side is designed around THP which puts a single high order page
> into the PUD/PMD such that the mm only has to juggle one page. This a
> very sane and reasonable thing to do.
>
> DAX is stuffing arrays of 4k pages into the PUD/PMDs. Aligning with
> THP would make using normal refconting much simpler. I looked at
> teaching the mm core to deal with page arrays - it is certainly
> doable, but it is quite inefficient and ugly mm code.

THP does not support PUD, and neither does FSDAX, so it's only PMDs we
need to worry about.

>
> So, can we fix DAX and TTM - the only uses of PUD/PMDs I could find?
>
> Joao has a series that does this to device-dax:
>
> https://lore.kernel.org/all/20210827145819.16471-1-joao.m.martins@oracle.com/

That assumes there's never any need to fracture a huge page which
FSDAX could not support unless the filesystem was built with 2MB block
size.

> TTM is kind of broken already but does have a struct page, and it is
> probably already a high order one. Maybe it is OK? I will ask Thomas
>
> That leaves FSDAX. Can this be fixed? I know nothing of filesystems or
> fsdax to guess. Sounds like folios to me ..

My thought here is to assemble a compound page on the fly when
establishing the FSDAX PMD mapping.

> Assuming changing FSDAX is hard.. How would DAX people feel about just
> deleting the PUD/PMD support until it can be done with compound pages?

There are end users that would notice the PMD regression, and I think
FSDAX PMDs with proper compound page metadata is on the same order of
work as fixing the refcount.

> Doing so would allow fixing the lifecycle, cleaning up gup and
> basically delete a huge wack of slow DAX and devmap specific code from
> the mm. It also opens the door to removing the PTE flag and thus
> allowing DAX on more architectures.
>
> > However, there still needs to be something in the gup-fast path to
> > indicate that GUP_LONGTERM is not possible because the PTE
> > represents
>
> It looks easy now:
>
> 1) We know the pfn has a struct page * because it isn't pfn special
>
> 2) We can get a valid ref on the struct page *:
>
>       head = try_grab_compound_head(page, 1, flags);
>
>    Holding a ref ensures that head->pgmap is valid.
>
> 3) Then check the page type directly:
>
>      if ((flags & FOLL_LONGTERM) && is_zone_device_page(head))
>
>    This tells us we can access the ZONE_DEVICE struct in the union
>
> 4) Ask what the pgmap owner wants to do:
>
>     if (head->pgmap->deny_foll_longterm)
>           return FAIL

The pgmap itself does not know, but the "holder" could specify this
policy. Which is in line with the 'dax_holder_ops' concept being
introduced for reverse mapping support. I.e. when the FS claims the
dax-device it can specify at that point that it wants to forbid
longterm.

> Cost is only paied if FOLL_LONGTERM is given

Yeah, that does naturally fall out from no longer needing to take an
explicit dev_pagemap reference and assuming a page is always there.


  parent reply	other threads:[~2021-10-17 18:35 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-14 15:39 [PATCH v1 0/2] mm: remove extra ZONE_DEVICE struct page refcount Alex Sierra
2021-10-14 15:39 ` [PATCH v1 1/2] ext4/xfs: add page refcount helper Alex Sierra
2021-10-14 16:25   ` Jason Gunthorpe
2021-10-14 16:40   ` Matthew Wilcox
2021-10-14 15:39 ` [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount Alex Sierra
2021-10-14 16:52   ` Matthew Wilcox
2021-10-14 17:06   ` Jason Gunthorpe
2021-10-14 17:35     ` Ralph Campbell
2021-10-14 18:01       ` Jason Gunthorpe
2021-10-14 20:57         ` Ralph Campbell
2021-10-15  3:45           ` Sierra Guiza, Alejandro (Alex)
2021-10-15 11:06             ` Jason Gunthorpe
2021-10-14 18:43     ` Matthew Wilcox
2021-10-14 19:01       ` Dan Williams
2021-10-14 23:06         ` Jason Gunthorpe
2021-10-15  1:37           ` Dan Williams
2021-10-16 15:44             ` Jason Gunthorpe
2021-10-16 16:39               ` Matthew Wilcox
2021-10-17 18:20                 ` Dan Williams
2021-10-17 18:35               ` Dan Williams [this message]
2021-10-18 18:25                 ` Jason Gunthorpe
2021-10-18 19:37                   ` Dan Williams
2021-10-18 23:06                     ` Jason Gunthorpe
2021-10-19 15:13                       ` Joao Martins
2021-10-19 16:01                         ` Jason Gunthorpe
2021-10-19 19:21                           ` Dan Williams
2021-10-20 17:06                             ` Joao Martins
2021-10-20 17:12                               ` Dan Williams
2021-10-20 18:51                                 ` Joao Martins
2021-11-15 19:33 [PATCH v1 0/2] Remove extra ZONE_DEVICE " Alex Sierra
2021-11-15 19:33 ` [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct " Alex Sierra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPcyv4j0kHREAOG6_07E2foz6e4FP8D72mZXH6ivsiUBu_8c6g@mail.gmail.com \
    --to=dan.j.williams@intel.com \
    --cc=Felix.Kuehling@amd.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.sierra@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=apopple@nvidia.com \
    --cc=dave.jiang@intel.com \
    --cc=david@redhat.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=hch@lst.de \
    --cc=jgg@nvidia.com \
    --cc=jglisse@redhat.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=nvdimm@lists.linux.dev \
    --cc=rcampbell@nvidia.com \
    --cc=vishal.l.verma@intel.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).