linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Mike Kravetz <mike.kravetz@oracle.com>
To: "Longpeng (Mike,
	Cloud Infrastructure Service Product Dept.)"
	<longpeng2@huawei.com>
Cc: akpm@linux-foundation.org, kirill.shutemov@linux.intel.com,
	linux-kernel@vger.kernel.org, arei.gonglei@huawei.com,
	weidong.huang@huawei.com, weifuqiang@huawei.com,
	kvm@vger.kernel.org, linux-mm@kvack.org,
	Matthew Wilcox <willy@infradead.org>,
	Sean Christopherson <sean.j.christopherson@intel.com>,
	stable@vger.kernel.org
Subject: Re: [PATCH v2] mm/hugetlb: fix a addressing exception caused by huge_pte_offset()
Date: Sun, 22 Mar 2020 19:54:32 -0700	[thread overview]
Message-ID: <e16fe81b-5c4c-e689-2f48-214f2025df2f@oracle.com> (raw)
In-Reply-To: <5700f44e-9df9-1b12-bc29-68e0463c2860@huawei.com>

On 3/22/20 7:03 PM, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
> 
> 
> On 2020/3/22 7:38, Mike Kravetz wrote:
>> On 2/21/20 7:33 PM, Longpeng(Mike) wrote:
>>> From: Longpeng <longpeng2@huawei.com>
>>>
>>> Our machine encountered a panic(addressing exception) after run
>>> for a long time and the calltrace is:
>>> RIP: 0010:[<ffffffff9dff0587>]  [<ffffffff9dff0587>] hugetlb_fault+0x307/0xbe0
>>> RSP: 0018:ffff9567fc27f808  EFLAGS: 00010286
>>> RAX: e800c03ff1258d48 RBX: ffffd3bb003b69c0 RCX: e800c03ff1258d48
>>> RDX: 17ff3fc00eda72b7 RSI: 00003ffffffff000 RDI: e800c03ff1258d48
>>> RBP: ffff9567fc27f8c8 R08: e800c03ff1258d48 R09: 0000000000000080
>>> R10: ffffaba0704c22a8 R11: 0000000000000001 R12: ffff95c87b4b60d8
>>> R13: 00005fff00000000 R14: 0000000000000000 R15: ffff9567face8074
>>> FS:  00007fe2d9ffb700(0000) GS:ffff956900e40000(0000) knlGS:0000000000000000
>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: ffffd3bb003b69c0 CR3: 000000be67374000 CR4: 00000000003627e0
>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> Call Trace:
>>>  [<ffffffff9df9b71b>] ? unlock_page+0x2b/0x30
>>>  [<ffffffff9dff04a2>] ? hugetlb_fault+0x222/0xbe0
>>>  [<ffffffff9dff1405>] follow_hugetlb_page+0x175/0x540
>>>  [<ffffffff9e15b825>] ? cpumask_next_and+0x35/0x50
>>>  [<ffffffff9dfc7230>] __get_user_pages+0x2a0/0x7e0
>>>  [<ffffffff9dfc648d>] __get_user_pages_unlocked+0x15d/0x210
>>>  [<ffffffffc068cfc5>] __gfn_to_pfn_memslot+0x3c5/0x460 [kvm]
>>>  [<ffffffffc06b28be>] try_async_pf+0x6e/0x2a0 [kvm]
>>>  [<ffffffffc06b4b41>] tdp_page_fault+0x151/0x2d0 [kvm]
>>>  [<ffffffffc075731c>] ? vmx_vcpu_run+0x2ec/0xc80 [kvm_intel]
>>>  [<ffffffffc0757328>] ? vmx_vcpu_run+0x2f8/0xc80 [kvm_intel]
>>>  [<ffffffffc06abc11>] kvm_mmu_page_fault+0x31/0x140 [kvm]
>>>  [<ffffffffc074d1ae>] handle_ept_violation+0x9e/0x170 [kvm_intel]
>>>  [<ffffffffc075579c>] vmx_handle_exit+0x2bc/0xc70 [kvm_intel]
>>>  [<ffffffffc074f1a0>] ? __vmx_complete_interrupts.part.73+0x80/0xd0 [kvm_intel]
>>>  [<ffffffffc07574c0>] ? vmx_vcpu_run+0x490/0xc80 [kvm_intel]
>>>  [<ffffffffc069f3be>] vcpu_enter_guest+0x7be/0x13a0 [kvm]
>>>  [<ffffffffc06cf53e>] ? kvm_check_async_pf_completion+0x8e/0xb0 [kvm]
>>>  [<ffffffffc06a6f90>] kvm_arch_vcpu_ioctl_run+0x330/0x490 [kvm]
>>>  [<ffffffffc068d919>] kvm_vcpu_ioctl+0x309/0x6d0 [kvm]
>>>  [<ffffffff9deaa8c2>] ? dequeue_signal+0x32/0x180
>>>  [<ffffffff9deae34d>] ? do_sigtimedwait+0xcd/0x230
>>>  [<ffffffff9e03aed0>] do_vfs_ioctl+0x3f0/0x540
>>>  [<ffffffff9e03b0c1>] SyS_ioctl+0xa1/0xc0
>>>  [<ffffffff9e53879b>] system_call_fastpath+0x22/0x27
>>>
>>> ( The kernel we used is older, but we think the latest kernel also has this
>>>   bug after dig into this problem. )
>>>
>>> For 1G hugepages, huge_pte_offset() wants to return NULL or pudp, but it
>>> may return a wrong 'pmdp' if there is a race. Please look at the following
>>> code snippet:
>>>     ...
>>>     pud = pud_offset(p4d, addr);
>>>     if (sz != PUD_SIZE && pud_none(*pud))
>>>         return NULL;
>>>     /* hugepage or swap? */
>>>     if (pud_huge(*pud) || !pud_present(*pud))
>>>         return (pte_t *)pud;
>>>
>>>     pmd = pmd_offset(pud, addr);
>>>     if (sz != PMD_SIZE && pmd_none(*pmd))
>>>         return NULL;
>>>     /* hugepage or swap? */
>>>     if (pmd_huge(*pmd) || !pmd_present(*pmd))
>>>         return (pte_t *)pmd;
>>>     ...
>>>
>>> The following sequence would trigger this bug:
>>> 1. CPU0: sz = PUD_SIZE and *pud = 0 , continue
>>> 1. CPU0: "pud_huge(*pud)" is false
>>> 2. CPU1: calling hugetlb_no_page and set *pud to xxxx8e7(PRESENT)
>>> 3. CPU0: "!pud_present(*pud)" is false, continue
>>> 4. CPU0: pmd = pmd_offset(pud, addr) and maybe return a wrong pmdp
>>> However, we want CPU0 to return NULL or pudp.
>>>
>>> We can avoid this race by read the pud only once. What's more, we also use
>>> READ_ONCE to access the entries for safe(e.g. avoid the compilier mischief)
>>>
>>> Cc: Matthew Wilcox <willy@infradead.org>
>>> Cc: Sean Christopherson <sean.j.christopherson@intel.com>
>>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>>> Cc: stable@vger.kernel.org
>>> Signed-off-by: Longpeng <longpeng2@huawei.com>
>>
>> Andrew dropped this patch from his tree which caused me to go back and
>> look at the status of this patch/issue.
>>
>> It is pretty obvious that code in the current huge_pte_offset routine
>> is racy.  I checked out the assembly code produced by my compiler and
>> verified that the line,
>>
>> 	if (pud_huge(*pud) || !pud_present(*pud))
>>
>> does actually dereference *pud twice.  So, the value could change between
>> those two dereferences.   Longpeng (Mike) could easlily recreate the issue
>> if he put a delay between the two dereferences.  I believe the only
>> reservations/concerns about the patch below was the use of READ_ONCE().
>> Is that correct?
>>
> Hi Mike,
> 
> It seems I've missed your another mail in my client, I found it here
> (https://lkml.org/lkml/2020/2/27/1927) just now.
> 
> I think we have reached an agreement that the pud/pmd need READ_ONCE in
> huge_pte_offset() and disagreement is whether the pgd/p4d also need READ_ONCE,
> right ?

Correct.

Sorry, I did not reply to the mail thread with more context.

>> Are there any objections to the patch if the READ_ONCE() calls are removed?
>>
> Because the pgd/p4g are only accessed and dereferenced once here, so some guys
> want to remove it.
> 
> But we must make sure they are *really* accessed once, in other words, this
> makes we need to care about both the implementation of pgd_present/p4d_present
> and the behavior of any compiler, for example:
> 
> '''
> static inline int func(int val)
> {
>     return subfunc1(val) & subfunc2(val);
> }
> 
> func(*p); // int *p
> '''
> We must make sure there's no strange compiler to generate an assemble code that
> access and dereference 'p' more than once.
> 
> I've not found any backwards with READ_ONCE here. However, if you also agree to
> remove READ_ONCE around pgd/p4d, I'll do.
> 

I would like to remove the READ_ONCE calls and move the patch forward.  It
does address a real issue you are seeing.

To be honest, I am more worried about the races in lookup_address_in_pgd()
than using or not using READ_ONCE for  pgd/p4d in this patch.

I have not looked closely at the generated code for lookup_address_in_pgd.
It appears that it would dereference p4d, pud and pmd multiple times.  Sean
seemed to think there was something about the calling context that would
make issues like those seen with huge_pte_offset less likely to happen.  I
do not know if this is accurate or not.

Let's remove the two READ_ONCE calls and move this patch forward.  We can
look closer at lookup_address_in_pgd and generate another patch if that needs
to be fixed as well.

Thanks
-- 
Mike Kravetz


  reply	other threads:[~2020-03-23  2:55 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-22  3:33 [PATCH v2] mm/hugetlb: fix a addressing exception caused by huge_pte_offset() Longpeng(Mike)
2020-03-21 23:38 ` Mike Kravetz
2020-03-23  2:03   ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2020-03-23  2:54     ` Mike Kravetz [this message]
2020-03-23  3:43       ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2020-03-23 14:40       ` Sean Christopherson
2020-03-23 16:44         ` Jason Gunthorpe
2020-03-23 16:09   ` Jason Gunthorpe
2020-03-23 17:27     ` Mike Kravetz
2020-03-23 18:07       ` Jason Gunthorpe
2020-03-23 20:35         ` Mike Kravetz
2020-03-23 22:52           ` Jason Gunthorpe
2020-03-24  2:37             ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2020-03-24 11:55               ` Jason Gunthorpe
2020-03-24 15:25                 ` Mike Kravetz
2020-03-24 15:55                   ` Jason Gunthorpe
2020-03-24 16:19                     ` Mike Kravetz
2020-03-24 17:59                       ` Jason Gunthorpe
2020-03-24 19:47                         ` Mike Kravetz
2020-02-22  5:23 Qian Cai
2020-02-22  6:33 ` Longpeng (Mike)
2020-02-22 11:50   ` Qian Cai
2020-02-22 17:02   ` Matthew Wilcox
2020-02-23  1:24     ` Longpeng (Mike)
2020-02-27 21:41       ` Mike Kravetz
2020-03-21 22:46         ` Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e16fe81b-5c4c-e689-2f48-214f2025df2f@oracle.com \
    --to=mike.kravetz@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=arei.gonglei@huawei.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=longpeng2@huawei.com \
    --cc=sean.j.christopherson@intel.com \
    --cc=stable@vger.kernel.org \
    --cc=weidong.huang@huawei.com \
    --cc=weifuqiang@huawei.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).