Re: can we finally kill off CONFIG_FS_DAX_LIMITED

From: Felix Kuehling <felix.kuehling@amd.com>
To: Jason Gunthorpe <jgg@nvidia.com>,
	Dan Williams <dan.j.williams@intel.com>
Cc: Joao Martins <joao.m.martins@oracle.com>,
	Gerald Schaefer <gerald.schaefer@linux.ibm.com>,
	Christoph Hellwig <hch@lst.de>,
	Heiko Carstens <hca@linux.ibm.com>,
	Vasily Gorbik <gor@linux.ibm.com>,
	Christian Borntraeger <borntraeger@de.ibm.com>,
	Linux NVDIMM <nvdimm@lists.linux.dev>,
	linux-s390 <linux-s390@vger.kernel.org>,
	Matthew Wilcox <willy@infradead.org>,
	Alex Sierra <alex.sierra@amd.com>, Linux MM <linux-mm@kvack.org>,
	Ralph Campbell <rcampbell@nvidia.com>,
	Alistair Popple <apopple@nvidia.com>,
	Vishal Verma <vishal.l.verma@intel.com>,
	Dave Jiang <dave.jiang@intel.com>,
	"Phillips, Daniel" <Daniel.Phillips@amd.com>
Subject: Re: can we finally kill off CONFIG_FS_DAX_LIMITED
Date: Tue, 19 Oct 2021 11:38:36 -0400	[thread overview]
Message-ID: <d5a7e72d-b366-4fb4-8c41-100e5d8ce020@amd.com> (raw)
In-Reply-To: <20211019142032.GT2744544@nvidia.com>

Am 2021-10-19 um 10:20 a.m. schrieb Jason Gunthorpe:
> On Mon, Oct 18, 2021 at 09:26:24PM -0700, Dan Williams wrote:
>> On Mon, Oct 18, 2021 at 4:31 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>>> On Fri, Oct 15, 2021 at 01:22:41AM +0100, Joao Martins wrote:
>>>
>>>> dev_pagemap_mapping_shift() does a lookup to figure out
>>>> which order is the page table entry represents. is_zone_device_page()
>>>> is already used to gate usage of dev_pagemap_mapping_shift(). I think
>>>> this might be an artifact of the same issue as 3) in which PMDs/PUDs
>>>> are represented with base pages and hence you can't do what the rest
>>>> of the world does with:
>>> This code is looks broken as written.
>>>
>>> vma_address() relies on certain properties that I maybe DAX (maybe
>>> even only FSDAX?) sets on its ZONE_DEVICE pages, and
>>> dev_pagemap_mapping_shift() does not handle the -EFAULT return. It
>>> will crash if a memory failure hits any other kind of ZONE_DEVICE
>>> area.
>> That case is gated with a TODO in memory_failure_dev_pagemap(). I
>> never got any response to queries about what to do about memory
>> failure vs HMM.
> Unfortunately neither Logan nor Felix noticed that TODO conditional
> when adding new types..

You mean this?

        if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
                /*
                 * TODO: Handle HMM pages which may need coordination
                 * with device-side memory.
                 */
                goto unlock;
        }

Yeah, I never looked at that. Alex, we'll need to add || pgmap->type ==
MEMORY_DEVICE_COHERENT here. Or should we change this into a test that
looks for the pgmap->types that are actually handled by
memory_failure_dev_pagemap? E.g.

        if (pgmap->type != MEMORY_DEVICE_FS_DAX)
                goto unlock;

I think in case of a real HW error, our driver should be calling
memory_failure. But then a callback from here back into the driver
wouldn't make sense.

For MADV_HWPOISON we may need a callback to the driver, if we want the
driver to treat it like an actual HW error and retire the page.

>
> But maybe it is dead code anyhow as it already has this:
>
> 	cookie = dax_lock_page(page);
> 	if (!cookie)
> 		goto out;
>
> Right before? Doesn't that already always fail for anything that isn't
> a DAX?

I guess the check for the pgmap->type should come before this.

Regards,
  Felix

>
>>> I'm not sure the comment is correct anyhow:
>>>
>>>                 /*
>>>                  * Unmap the largest mapping to avoid breaking up
>>>                  * device-dax mappings which are constant size. The
>>>                  * actual size of the mapping being torn down is
>>>                  * communicated in siginfo, see kill_proc()
>>>                  */
>>>                 unmap_mapping_range(page->mapping, start, size, 0);
>>>
>>> Beacuse for non PageAnon unmap_mapping_range() does either
>>> zap_huge_pud(), __split_huge_pmd(), or zap_huge_pmd().
>>>
>>> Despite it's name __split_huge_pmd() does not actually split, it will
>>> call __split_huge_pmd_locked:
>>>
>>>         } else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd)))
>>>                 goto out;
>>>         __split_huge_pmd_locked(vma, pmd, range.start, freeze);
>>>
>>> Which does
>>>         if (!vma_is_anonymous(vma)) {
>>>                 old_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
>>>
>>> Which is a zap, not split.
>>>
>>> So I wonder if there is a reason to use anything other than 4k here
>>> for DAX?
>>>
>>>>       tk->size_shift = page_shift(compound_head(p));
>>>>
>>>> ... as page_shift() would just return PAGE_SHIFT (as compound_order() is 0).
>>> And what would be so wrong with memory failure doing this as a 4k
>>> page?
>> device-dax does not support misaligned mappings. It makes hard
>> guarantees for applications that can not afford the page table
>> allocation overhead of sub-1GB mappings.
> memory-failure is the wrong layer to enforce this anyhow - if someday
> unmap_mapping_range() did learn to break up the 1GB pages then we'd
> want to put the condition to preserve device-dax mappings there, not
> way up in memory-failure.
>
> So we can just delete the detection of the page size and rely on the
> zap code to wipe out the entire level, not split it. Which is what we
> have today already.
>
> Jason