Re: [PATCH 03/11] mm: page_vma_mapped_walk(): use pmd_read_atomic()

From: Jason Gunthorpe <jgg@ziepe.ca>
To: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Yang Shi <shy828301@gmail.com>,
	Wang Yugui <wangyugui@e16-tech.com>,
	Matthew Wilcox <willy@infradead.org>,
	Alistair Popple <apopple@nvidia.com>,
	Ralph Campbell <rcampbell@nvidia.com>, Zi Yan <ziy@nvidia.com>,
	Peter Xu <peterx@redhat.com>, Will Deacon <will@kernel.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 03/11] mm: page_vma_mapped_walk(): use pmd_read_atomic()
Date: Fri, 11 Jun 2021 12:36:13 -0300	[thread overview]
Message-ID: <20210611153613.GR1096940@ziepe.ca> (raw)
In-Reply-To: <aebb6b96-153e-7d7-59da-f6bad4337aa7@google.com>

On Thu, Jun 10, 2021 at 11:37:14PM -0700, Hugh Dickins wrote:
> On Thu, 10 Jun 2021, Jason Gunthorpe wrote:
> > On Thu, Jun 10, 2021 at 12:06:17PM +0300, Kirill A. Shutemov wrote:
> > > On Wed, Jun 09, 2021 at 11:38:11PM -0700, Hugh Dickins wrote:
> > > > page_vma_mapped_walk() cleanup: use pmd_read_atomic() with barrier()
> > > > instead of READ_ONCE() for pmde: some architectures (e.g. i386 with PAE)
> > > > have a multi-word pmd entry, for which READ_ONCE() is not good enough.
> > > > 
> > > > Signed-off-by: Hugh Dickins <hughd@google.com>
> > > > Cc: <stable@vger.kernel.org>
> > > >  mm/page_vma_mapped.c | 5 ++++-
> > > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > > > index 7c0504641fb8..973c3c4e72cc 100644
> > > > +++ b/mm/page_vma_mapped.c
> > > > @@ -182,13 +182,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> > > >  	pud = pud_offset(p4d, pvmw->address);
> > > >  	if (!pud_present(*pud))
> > > >  		return false;
> > > > +
> > > >  	pvmw->pmd = pmd_offset(pud, pvmw->address);
> > > >  	/*
> > > >  	 * Make sure the pmd value isn't cached in a register by the
> > > >  	 * compiler and used as a stale value after we've observed a
> > > >  	 * subsequent update.
> > > >  	 */
> > > > -	pmde = READ_ONCE(*pvmw->pmd);
> > > > +	pmde = pmd_read_atomic(pvmw->pmd);
> > > > +	barrier();
> > > > +
> > > 
> > > Hm. It makes me wounder if barrier() has to be part of pmd_read_atomic().
> > > mm/hmm.c uses the same pattern as you are and I tend to think that the
> > > rest of pmd_read_atomic() users may be broken.
> > > 
> > > Am I wrong?
> > 
> > I agree with you, something called _atomic should not require the
> > caller to provide barriers.
> > 
> > I think the issue is simply that the two implementations of
> > pmd_read_atomic() should use READ_ONCE() internally, no?
> 
> I've had great difficulty coming up with answers for you.
> 
> This patch was based on two notions I've had lodged in my mind
> for several years:
> 
> 1) that pmd_read_atomic() is the creme-de-la-creme for reading a pmd_t
> atomically (even if the pmd_t spans multiple words); but reading again
> after all this time the comment above it, it seems to be more specialized
> than I'd thought (biggest selling point being for when you want to check
> pmd_none(), which we don't).  And has no READ_ONCE() or barrier() inside,
> so really needs that barrier() to be as safe as the previous READ_ONCE().

IMHO it is a simple bug that the 64 bit version of this is not marked
with READ_ONCE() internally. It is reading data without a lock, AFAIK
our modern understanding of the memory model is that requires
READ_ONCE().

diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index e896ebef8c24cb..0bf1fdec928e71 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -75,7 +75,7 @@ static inline void native_set_pte(pte_t *ptep, pte_t pte)
 static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
 {
 	pmdval_t ret;
-	u32 *tmp = (u32 *)pmdp;
+	u32 *tmp = READ_ONCE((u32 *)pmdp);
 
 	ret = (pmdval_t) (*tmp);
 	if (ret) {
@@ -84,7 +84,7 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
 		 * or we can end up with a partial pmd.
 		 */
 		smp_rmb();
-		ret |= ((pmdval_t)*(tmp + 1)) << 32;
+		ret |= READ_ONCE((pmdval_t)*(tmp + 1)) << 32;
 	}
 
 	return (pmd_t) { ret };
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 46b13780c2c8c9..c8c7a3307d2773 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1228,7 +1228,7 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
 	 * only going to work, if the pmdval_t isn't larger than
 	 * an unsigned long.
 	 */
-	return *pmdp;
+	return READ_ONCE(*pmdp);
 }
 #endif
 

> 2) the barrier() in mm_find_pmd(), that replaced an earlier READ_ONCE(),
> because READ_ONCE() did not work (did not give the necessary guarantee? or
> did not build?) on architectures with multiple word pmd_ts e.g. i386 PAE.

This is really interesting, the git history e37c69827063 ("mm: replace
ACCESS_ONCE with READ_ONCE or barriers") says the READ_ONCE was
dropped here "because it doesn't work on non-scalar types" due to a
(now 8 year old) gcc bug.

According to the gcc bug READ_ONCE() on anything that is a scalar
sized struct triggers GCC to ignore the READ_ONCEyness. To work around
this bug then READ_ONCE can never be used on any of the struct
protected page table elements. While I am not 100% sure, it looks like
this is a pre gcc 4.9 bug, and since gcc 4.9 is now the minimum
required compiler this is all just cruft. Use READ_ONCE() here too...

> But I've now come across some changes that Will Deacon made last year:
> the include/asm-generic/rwonce.h READ_ONCE() now appears to allow for
> native word type *or* type sizeof(long long) (e.g. i386 PAE) - given
> "a strong prevailing wind" anyway :)  And 8e958839e4b9 ("sparc32: mm:
> Restructure sparc32 MMU page-table layout") put an end to sparc32's
> typedef struct { unsigned long pmdv[16]; } pmd_t.

Indeed, that is good news

> As to your questions about pmd_read_atomic() usage elsewhere:
> please don't force me to think so hard!  (And you've set me half-
> wondering, whether there are sneaky THP transitions, perhaps of the
> "unstable" kind, that page_vma_mapped_walk() should be paying more
> attention to: but for sanity's sake I won't go there, not now.)

If I recall, (and I didn't recheck this right now) the only thing
pmd_read_atomic() provides is the special property that if the pmd's
flags are observed to point to a pte table then pmd_read_atomic() will
reliably return the pte table pointer.

Otherwise it returns the flags and a garbage pointer because under the
THP protocol a PMD pointing at a page can be converted to a PTE table
if you hold the mmap sem in read mode. Upgrading a PTE table to a PMD
page requires mmap sem in write mode so once a PTE table is observed
'locklessly' the value is stable.. Or at least so says the documentation

pmd_read_atomic() is badly named, tricky to use, and missing the
READ_ONCE() because it is so old..

As far as is page_vma_mapped_walk correct.. Everything except
is_pmd_migration_entry() is fine to my eye, and I simply don't know
the rules aroudn migration entries to know right/wrong.

I suspect it probably is required to manipulate a migration entry
while holding the mmap_sem in write mode??

There is some list of changes to the page table that require holding
the mmap sem in write mode that I've never seen documented for..

Jason