nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@nvidia.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: "Matthew Wilcox" <willy@infradead.org>,
	"Alex Sierra" <alex.sierra@amd.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Kuehling, Felix" <Felix.Kuehling@amd.com>,
	"Linux MM" <linux-mm@kvack.org>,
	"Ralph Campbell" <rcampbell@nvidia.com>,
	linux-ext4 <linux-ext4@vger.kernel.org>,
	linux-xfs <linux-xfs@vger.kernel.org>,
	"amd-gfx list" <amd-gfx@lists.freedesktop.org>,
	"Maling list - DRI developers" <dri-devel@lists.freedesktop.org>,
	"Christoph Hellwig" <hch@lst.de>,
	"Jérôme Glisse" <jglisse@redhat.com>,
	"Alistair Popple" <apopple@nvidia.com>,
	"Vishal Verma" <vishal.l.verma@intel.com>,
	"Dave Jiang" <dave.jiang@intel.com>,
	"Linux NVDIMM" <nvdimm@lists.linux.dev>
Subject: Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount
Date: Sat, 16 Oct 2021 12:44:50 -0300	[thread overview]
Message-ID: <20211016154450.GJ2744544@nvidia.com> (raw)
In-Reply-To: <CAPcyv4hC4qxbO46hp=XBpDaVbeh=qdY6TgvacXRprQ55Qwe-Dg@mail.gmail.com>

On Thu, Oct 14, 2021 at 06:37:35PM -0700, Dan Williams wrote:
> On Thu, Oct 14, 2021 at 4:06 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Thu, Oct 14, 2021 at 12:01:14PM -0700, Dan Williams wrote:
> > > > > Does anyone know why devmap is pte_special anyhow?
> > >
> > > It does not need to be special as mentioned here:
> > >
> > > https://lore.kernel.org/all/CAPcyv4iFeVDVPn6uc=aKsyUvkiu3-fK-N16iJVZQ3N8oT00hWA@mail.gmail.com/
> >
> > I added a remark there
> >
> > Not special means more to me, it means devmap should do the refcounts
> > properly like normal memory pages.
> >
> > It means vm_normal_page should return !NULL and it means insert_page,
> > not insert_pfn should be used to install them in the PTE. VMAs should
> > not be MIXED MAP, but normal struct page maps.
> >
> > I think this change alone would fix all the refcount problems
> > everwhere in DAX and devmap.
> >
> > > The refcount dependencies also go away after this...
> > >
> > > https://lore.kernel.org/all/161604050866.1463742.7759521510383551055.stgit@dwillia2-desk3.amr.corp.intel.com/
> > >
> > > ...but you can see that patches 1 and 2 in that series depend on being
> > > able to guarantee that all mappings are invalidated when the undelying
> > > device that owns the pgmap goes away.
> >
> > If I have put everything together right this is because of what I
> > pointed to here. FS-DAX is installing 0 refcount pages into PTEs and
> > expecting that to work sanely.
> >
> > This means the page map cannot be removed until all the PTEs are fully
> > flushed, which buggily doesn't happen because of the missing unplug.
> >
> > However, this is all because nobody incrd a refcount to represent the
> > reference in the PTE and since this ment that 0 refcount pages were
> > wrongly stuffed into PTEs then devmap used the refcount == 1 hack to
> > unbreak GUP?
> >
> > So.. Is there some reason why devmap pages are trying so hard to avoid
> > sane refcounting???
>
> I wouldn't put it that way. It's more that the original sin of
> ZONE_DEVICE that sought to reuse the lru field space, by never having
> a zero recount, then got layered upon and calcified in malignant ways.
> In the meantime surrounding infrastructure got decrustified. Work like
> the 'struct page' cleanup among other things, made it clearer and
> clearer over time that the original design choice needed to be fixed.

So, there used to be some reason, but with the current code
arrangement it is not the case? This is why it looks so strange when
reading it..

AFIACT we are good on the LRU stuff now? Eg release_pages() does not
use page->lru for is_zone_device_page()?

> The MIXED_MAP and insert_pfn were a holdover from page-less DAX, but
> now that we have page-available DAX, yes, we can skip the FS
> notification and just rely on typical refcounting and hanging until
> the FS has a chance to uninstall the PTEs. You're right, the FS
> notification is an improvement to the conversion, not a requirement.

It all sounds so simple. I looked at this for a good long time and the
first major roadblock is huge pages.

The mm side is designed around THP which puts a single high order page
into the PUD/PMD such that the mm only has to juggle one page. This a
very sane and reasonable thing to do.

DAX is stuffing arrays of 4k pages into the PUD/PMDs. Aligning with
THP would make using normal refconting much simpler. I looked at
teaching the mm core to deal with page arrays - it is certainly
doable, but it is quite inefficient and ugly mm code.

So, can we fix DAX and TTM - the only uses of PUD/PMDs I could find?

Joao has a series that does this to device-dax:

https://lore.kernel.org/all/20210827145819.16471-1-joao.m.martins@oracle.com/

TTM is kind of broken already but does have a struct page, and it is
probably already a high order one. Maybe it is OK? I will ask Thomas

That leaves FSDAX. Can this be fixed? I know nothing of filesystems or
fsdax to guess. Sounds like folios to me ..

Assuming changing FSDAX is hard.. How would DAX people feel about just
deleting the PUD/PMD support until it can be done with compound pages?

Doing so would allow fixing the lifecycle, cleaning up gup and
basically delete a huge wack of slow DAX and devmap specific code from
the mm. It also opens the door to removing the PTE flag and thus
allowing DAX on more architectures.

> However, there still needs to be something in the gup-fast path to
> indicate that GUP_LONGTERM is not possible because the PTE
> represents

It looks easy now:

1) We know the pfn has a struct page * because it isn't pfn special

2) We can get a valid ref on the struct page *:

      head = try_grab_compound_head(page, 1, flags);

   Holding a ref ensures that head->pgmap is valid.

3) Then check the page type directly:

     if ((flags & FOLL_LONGTERM) && is_zone_device_page(head))

   This tells us we can access the ZONE_DEVICE struct in the union

4) Ask what the pgmap owner wants to do:

    if (head->pgmap->deny_foll_longterm)
          return FAIL

Cost is only paied if FOLL_LONGTERM is given

Thanks,
Jason

  reply	other threads:[~2021-10-16 15:44 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20211014153928.16805-1-alex.sierra@amd.com>
     [not found] ` <20211014153928.16805-3-alex.sierra@amd.com>
     [not found]   ` <20211014170634.GV2744544@nvidia.com>
2021-10-14 18:43     ` [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount Matthew Wilcox
2021-10-14 19:01       ` Dan Williams
2021-10-14 23:06         ` Jason Gunthorpe
2021-10-15  1:37           ` Dan Williams
2021-10-16 15:44             ` Jason Gunthorpe [this message]
2021-10-16 16:39               ` Matthew Wilcox
2021-10-17 18:20                 ` Dan Williams
2021-10-17 18:35               ` Dan Williams
2021-10-18 18:25                 ` Jason Gunthorpe
2021-10-18 19:37                   ` Dan Williams
2021-10-18 23:06                     ` Jason Gunthorpe
2021-10-19 15:13                       ` Joao Martins
2021-10-19 16:01                         ` Jason Gunthorpe
2021-10-19 19:21                           ` Dan Williams
2021-10-20 17:06                             ` Joao Martins
2021-10-20 17:12                               ` Dan Williams
2021-10-20 18:51                                 ` Joao Martins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211016154450.GJ2744544@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=Felix.Kuehling@amd.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.sierra@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=apopple@nvidia.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=hch@lst.de \
    --cc=jglisse@redhat.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=nvdimm@lists.linux.dev \
    --cc=rcampbell@nvidia.com \
    --cc=vishal.l.verma@intel.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).