Re: [RFC] Make the memory failure blast radius more precise

From: "HORIGUCHI NAOYA(堀口　直也)" <naoya.horiguchi@nec.com>
To: David Rientjes <rientjes@google.com>
Cc: "Luck, Tony" <tony.luck@intel.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	Peter Xu <peterx@redhat.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Matthew Wilcox <willy@infradead.org>,
	Borislav Petkov <bp@alien8.de>,
	"linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	Jane Chu <jane.chu@oracle.com>
Subject: Re: [RFC] Make the memory failure blast radius more precise
Date: Thu, 25 Jun 2020 02:16:42 +0000	[thread overview]
Message-ID: <20200625021641.GA21811@hori.linux.bs1.fc.nec.co.jp> (raw)
In-Reply-To: <alpine.DEB.2.22.394.2006232114100.97817@chino.kir.corp.google.com>

On Tue, Jun 23, 2020 at 09:32:41PM -0700, David Rientjes wrote:
> On Tue, 23 Jun 2020, Luck, Tony wrote:
> 
> > > Hardware actually tells us the blast radius of the error, but we ignore
> > > it and take out the entire page.  We've had a customer request to know
> > > exactly how much of the page is damaged so they can avoid reconstructing
> > > an entire 2MB page if only a single cacheline is damaged.
> > > 
> > > This is only a strawman that I did in an hour or two; I'd appreciate
> > > architectural-level feedback.  Should I just convert memory_failure() to
> > > always take an address & granularity?  Should I create a struct to pass
> > > around (page, phys, granularity) instead of reconstructing the missing
> > > pieces in half a dozen functions?  Is this functionality welcome at all,
> > > or is the risk of upsetting applications which expect at least a page
> > > of granularity too high?
> > 
> > What is the interface to these applications that want finer granularity?
> > 
> > Current code does very poorly with hugetlbfs pages ... user loses the
> > whole 2 MB or 1GB. That's just silly (though I've been told that it is
> > hard to fix because allowing a hugetlbfs page to be broken up at an arbitrary
> > time as the result of a mahcine check means that the kernel needs locking
> > around a bunch of fas paths that currently assume that a huge page will
> > stay being a huge page).
> > 
> 
> Thanks for bringing this up, Tony.  Mike Kravetz pointed me to this thread 
> (thanks Mike!) so let's add him in explicitly as well as Andrea, Peter, 
> and David from Red Hat who we've been discussing an idea with that may 
> introduce exactly this needed support but for different purposes :)  The 
> timing of this thread is _uncanny_.
> 
> To improve the performance of userfaultfd for the purposes of post-copy 
> live migration we need to reduce the granularity in which pages are 
> migrated; we're looking at this from a 1GB gigantic page perspective but 
> the same arguments can likely be had for 2MB hugepages as well.  1GB pages 
> are too much of a bottleneck and, as you bring up, 1GB is simply too much 
> memory to poison :)  We don't have 1GB thp support so the big idea was to 
> introduce thp-like DoubleMap support into hugetlbfs for the purposes of 
> post-copy live migration and then I had the idea that this could be 
> extended to memory failure as well.
> 
> (We don't see the lack of 1GB thp here as a deficiency for anything other 
> than these two issues, hugetlb provides strong guarantees.)
> 
> I don't want to hijack Matthew's thread which is primarily about DAX, but 
> did get intrigued by your concerns about hugetlbfs page poisoning.  We can 
> fork the thread off here to discuss only the hugetlb application of this 
> if it makes sense to you or you'd like to collaborate on it as well.
> 
> The DoubleMap support would allow us to map the 1GB gigantic pages with 
> the PUD and the PMDs as well (and, further, the 2MB hugepages with the PMD 
> and PTEs) so that we can copy fragments into PMDs or PTEs and we don't 
> need to migrate the entire gigantic page.  Any access triggers #PF through 
> hugetlb_no_page() -> handle_userfault() which would trigger another 
> UFFDIO_COPY and map another fragment.
>
> Assume a world where this DoubleMap support already exists for hugetlb 
> pages today and all the invariants including page migration are fixed up 
> (since a PTE can now map a hugetlb page and a PMD can now map a gigantic 
> hugetlb page).  It *seems* like we'd be able to reduce the blast radius 
> here too on a hard memory failure: dissolve the gigantic page in place, 
> SIGBUS/SIGKILL on the bad PMD or PTE, and avoid poisoning the head of the 
> hugetlb page.  We agree that poisoning this large amount of memory is not 
> ideal :)
> 
> Anyway, this was some brainstorming that I was doing with Mike and the 
> others based on the idea of using DoubleMap support for post-copy live 
> migration.  If you would be interested or would like to collaborate on 
> it, we'd love to talk.

Thanks for proposing. I think that DoubleMap support could be a good
solution generally (not only for the usecase of post-copy live migration).
Splitting pud/pmd entry into pmd/pte entry makes smaller impact than migrating
all healthy data to somewhere else.  The implementation could be challenging
but not so as thp splitting because we don't have to consider collapsing.

Dax mapping seems to have similar issue. If we can share pmd mapping and pte
mapping to a dax file and covert the pmd mapping into pte mapping, we could
contain errors in smaller granularity for pmem.

Thanks,
Naoya Horiguchi