linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Joao Martins <joao.m.martins@oracle.com>,
	 Gerald Schaefer <gerald.schaefer@linux.ibm.com>,
	Christoph Hellwig <hch@lst.de>,
	 Heiko Carstens <hca@linux.ibm.com>,
	Vasily Gorbik <gor@linux.ibm.com>,
	 Christian Borntraeger <borntraeger@de.ibm.com>,
	Linux NVDIMM <nvdimm@lists.linux.dev>,
	 linux-s390 <linux-s390@vger.kernel.org>,
	Matthew Wilcox <willy@infradead.org>,
	 Alex Sierra <alex.sierra@amd.com>,
	"Kuehling, Felix" <Felix.Kuehling@amd.com>,
	 Linux MM <linux-mm@kvack.org>,
	Ralph Campbell <rcampbell@nvidia.com>,
	 Alistair Popple <apopple@nvidia.com>,
	Vishal Verma <vishal.l.verma@intel.com>,
	 Dave Jiang <dave.jiang@intel.com>
Subject: Re: can we finally kill off CONFIG_FS_DAX_LIMITED
Date: Tue, 19 Oct 2021 10:38:42 -0700	[thread overview]
Message-ID: <CAPcyv4jAQVSKB7rts5Mfu0JRtB-b1NGFgu03+8-ja8o11d1vQA@mail.gmail.com> (raw)
In-Reply-To: <20211019142032.GT2744544@nvidia.com>

On Tue, Oct 19, 2021 at 7:25 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Mon, Oct 18, 2021 at 09:26:24PM -0700, Dan Williams wrote:
> > On Mon, Oct 18, 2021 at 4:31 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Fri, Oct 15, 2021 at 01:22:41AM +0100, Joao Martins wrote:
> > >
> > > > dev_pagemap_mapping_shift() does a lookup to figure out
> > > > which order is the page table entry represents. is_zone_device_page()
> > > > is already used to gate usage of dev_pagemap_mapping_shift(). I think
> > > > this might be an artifact of the same issue as 3) in which PMDs/PUDs
> > > > are represented with base pages and hence you can't do what the rest
> > > > of the world does with:
> > >
> > > This code is looks broken as written.
> > >
> > > vma_address() relies on certain properties that I maybe DAX (maybe
> > > even only FSDAX?) sets on its ZONE_DEVICE pages, and
> > > dev_pagemap_mapping_shift() does not handle the -EFAULT return. It
> > > will crash if a memory failure hits any other kind of ZONE_DEVICE
> > > area.
> >
> > That case is gated with a TODO in memory_failure_dev_pagemap(). I
> > never got any response to queries about what to do about memory
> > failure vs HMM.
>
> Unfortunately neither Logan nor Felix noticed that TODO conditional
> when adding new types..
>
> But maybe it is dead code anyhow as it already has this:
>
>         cookie = dax_lock_page(page);
>         if (!cookie)
>                 goto out;
>
> Right before? Doesn't that already always fail for anything that isn't
> a DAX?

Yes, I originally made that ordering mistake in:

6100e34b2526 mm, memory_failure: Teach memory_failure() about dev_pagemap pages

...however, if we complete the move away from page-less DAX it also
allows for the locking to move from the xarray to lock_page(). I.e.
dax_lock_page() is pinning the inode after the fact, but I suspect the
inode should have been pinned when the mapping was established. Which
raises a question for the reflink support whether it is pinning all
involved inodes while the mapping is established?

>
> > > I'm not sure the comment is correct anyhow:
> > >
> > >                 /*
> > >                  * Unmap the largest mapping to avoid breaking up
> > >                  * device-dax mappings which are constant size. The
> > >                  * actual size of the mapping being torn down is
> > >                  * communicated in siginfo, see kill_proc()
> > >                  */
> > >                 unmap_mapping_range(page->mapping, start, size, 0);
> > >
> > > Beacuse for non PageAnon unmap_mapping_range() does either
> > > zap_huge_pud(), __split_huge_pmd(), or zap_huge_pmd().
> > >
> > > Despite it's name __split_huge_pmd() does not actually split, it will
> > > call __split_huge_pmd_locked:
> > >
> > >         } else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd)))
> > >                 goto out;
> > >         __split_huge_pmd_locked(vma, pmd, range.start, freeze);
> > >
> > > Which does
> > >         if (!vma_is_anonymous(vma)) {
> > >                 old_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
> > >
> > > Which is a zap, not split.
> > >
> > > So I wonder if there is a reason to use anything other than 4k here
> > > for DAX?
> > >
> > > >       tk->size_shift = page_shift(compound_head(p));
> > > >
> > > > ... as page_shift() would just return PAGE_SHIFT (as compound_order() is 0).
> > >
> > > And what would be so wrong with memory failure doing this as a 4k
> > > page?
> >
> > device-dax does not support misaligned mappings. It makes hard
> > guarantees for applications that can not afford the page table
> > allocation overhead of sub-1GB mappings.
>
> memory-failure is the wrong layer to enforce this anyhow - if someday
> unmap_mapping_range() did learn to break up the 1GB pages then we'd
> want to put the condition to preserve device-dax mappings there, not
> way up in memory-failure.
>
> So we can just delete the detection of the page size and rely on the
> zap code to wipe out the entire level, not split it. Which is what we
> have today already.

As Joao points out, userspace wants to know the blast radius of the
unmap for historical reasons. I do think it's worth deprecating that
somehow... providing a better error management interface is part of
the DAX-reflink enabling.


  parent reply	other threads:[~2021-10-19 17:38 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20210820054340.GA28560@lst.de>
     [not found] ` <20210823160546.0bf243bf@thinkpad>
     [not found]   ` <20210823214708.77979b3f@thinkpad>
     [not found]     ` <CAPcyv4jijqrb1O5OOTd5ftQ2Q-5SVwNRM7XMQ+N3MAFxEfvxpA@mail.gmail.com>
     [not found]       ` <e250feab-1873-c91d-5ea9-39ac6ef26458@oracle.com>
     [not found]         ` <CAPcyv4jYXPWmT2EzroTa7RDz1Z68Qz8Uj4MeheQHPbBXdfS4pA@mail.gmail.com>
     [not found]           ` <20210824202449.19d524b5@thinkpad>
     [not found]             ` <CAPcyv4iFeVDVPn6uc=aKsyUvkiu3-fK-N16iJVZQ3N8oT00hWA@mail.gmail.com>
2021-10-14 23:04               ` can we finally kill off CONFIG_FS_DAX_LIMITED Jason Gunthorpe
2021-10-15  0:22                 ` Joao Martins
2021-10-18 23:30                   ` Jason Gunthorpe
2021-10-19  4:26                     ` Dan Williams
2021-10-19 14:20                       ` Jason Gunthorpe
2021-10-19 15:20                         ` Joao Martins
2021-10-19 15:38                         ` Felix Kuehling
2021-10-19 17:38                         ` Dan Williams [this message]
2021-10-19 17:54                           ` Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPcyv4jAQVSKB7rts5Mfu0JRtB-b1NGFgu03+8-ja8o11d1vQA@mail.gmail.com \
    --to=dan.j.williams@intel.com \
    --cc=Felix.Kuehling@amd.com \
    --cc=alex.sierra@amd.com \
    --cc=apopple@nvidia.com \
    --cc=borntraeger@de.ibm.com \
    --cc=dave.jiang@intel.com \
    --cc=gerald.schaefer@linux.ibm.com \
    --cc=gor@linux.ibm.com \
    --cc=hca@linux.ibm.com \
    --cc=hch@lst.de \
    --cc=jgg@nvidia.com \
    --cc=joao.m.martins@oracle.com \
    --cc=linux-mm@kvack.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=nvdimm@lists.linux.dev \
    --cc=rcampbell@nvidia.com \
    --cc=vishal.l.verma@intel.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).