Re: [PATCH v2] mm/hugetlb: fix a addressing exception caused by huge_pte_offset()

From: Jason Gunthorpe <jgg@ziepe.ca>
To: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "Longpeng (Mike)" <longpeng2@huawei.com>,
	akpm@linux-foundation.org, kirill.shutemov@linux.intel.com,
	linux-kernel@vger.kernel.org, arei.gonglei@huawei.com,
	weidong.huang@huawei.com, weifuqiang@huawei.com,
	kvm@vger.kernel.org, linux-mm@kvack.org,
	Matthew Wilcox <willy@infradead.org>,
	Sean Christopherson <sean.j.christopherson@intel.com>,
	stable@vger.kernel.org
Subject: Re: [PATCH v2] mm/hugetlb: fix a addressing exception caused by huge_pte_offset()
Date: Mon, 23 Mar 2020 19:52:25 -0300	[thread overview]
Message-ID: <20200323225225.GF20941@ziepe.ca> (raw)
In-Reply-To: <88698dd7-eb87-4b0b-7ba7-44ef6eab6a6c@oracle.com>

On Mon, Mar 23, 2020 at 01:35:07PM -0700, Mike Kravetz wrote:
> On 3/23/20 11:07 AM, Jason Gunthorpe wrote:
> > On Mon, Mar 23, 2020 at 10:27:48AM -0700, Mike Kravetz wrote:
> > 
> >>>  	pgd = pgd_offset(mm, addr);
> >>> -	if (!pgd_present(*pgd))
> >>> +	if (!pgd_present(READ_ONCE(*pgd)))
> >>>  		return NULL;
> >>>  	p4d = p4d_offset(pgd, addr);
> >>> -	if (!p4d_present(*p4d))
> >>> +	if (!p4d_present(READ_ONCE(*p4d)))
> >>>  		return NULL;
> >>>  
> >>>       pud = pud_offset(p4d, addr);
> >>
> >> One would argue that pgd and p4d can not change from present to !present
> >> during the execution of this code.  To me, that seems like the issue which
> >> would cause an issue.  Of course, I could be missing something.
> > 
> > This I am not sure of, I think it must be true under the read side of
> > the mmap_sem, but probably not guarenteed under RCU..
> > 
> > In any case, it doesn't matter, the fact that *p4d can change at all
> > is problematic. Unwinding the above inlines we get:
> > 
> >   p4d = p4d_offset(pgd, addr)
> >   if (!p4d_present(*p4d))
> >       return NULL;
> >   pud = (pud_t *)p4d_page_vaddr(*p4d) + pud_index(address);
> > 
> > According to our memory model the compiler/CPU is free to execute this
> > as:
> > 
> >   p4d = p4d_offset(pgd, addr)
> >   p4d_for_vaddr = *p4d;
> >   if (!p4d_present(*p4d))
> >       return NULL;
> >   pud = (pud_t *)p4d_page_vaddr(p4d_for_vaddr) + pud_index(address);
> > 
> 
> Wow!  How do you know this?  You don't need to answer :)

It says explicitly in Documentation/memory-barriers.txt - see
section COMPILER BARRIER:

 (*) The compiler is within its rights to reorder loads and stores
     to the same variable, and in some cases, the CPU is within its
     rights to reorder loads to the same variable.  This means that
     the following code:

        a[0] = x;
        a[1] = x;

     Might result in an older value of x stored in a[1] than in a[0].

It also says READ_ONCE puts things in program order, but we don't use
READ_ONCE inside pud_offset(), so it doesn't help us.

Best answer is to code things so there is exactly one dereference of
the pointer protected by READ_ONCE. Very clear to read, very safe.

Maybe Longpeng can rework the patch around these principles?

Also I wonder if the READ_ONCE(*pmdp) is OK. gup_pmd_range() uses it,
but I can't explain why it shouldn't be pmd_read_atomic().

> > In the case where p4 goes from !present -> present (ie
> > handle_mm_fault()):
> > 
> > p4d_for_vaddr == p4d_none, and p4d_present(*p4d) == true, meaning the
> > p4d_page_vaddr() will crash.
> > 
> > Basically the problem here is not just missing READ_ONCE, but that the
> > p4d is read multiple times at all. It should be written like gup_fast
> > does, to guarantee a single CPU read of the unstable data:
> > 
> >   p4d = READ_ONCE(*p4d_offset(pgdp, addr));
> >   if (!p4d_present(p4))
> >       return NULL;
> >   pud = pud_offset(&p4d, addr);
> > 
> > At least this is what I've been able to figure out :\
> 
> In that case, I believe there are a bunch of similar routines with this issue.

Yes, my look around page walk related users makes me come to a similar
worry.

Fortunately, I think this is largely theoretical as most likely the
compiler will generate a single store for these coding patterns. 

That said, there have been bugs in the past, see commit 26c191788f18
("mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race
condition") which is significantly related to the compiler lifting a
load inside pte_offset to before the required 'if (pmd_*)' checks.

> For this patch, I was primarily interested in seeing the obvious
> multiple dereferences in C fixed up.  This is above and beyond that!
> :)

Well, I think it is worth solving the underlying problem
properly. Otherwise we get weird solutions to data races like
pmd_trans_unstable()...

Jason