All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Kravetz <mike.kravetz@oracle.com>
To: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Li Wang <liwang@redhat.com>, Linux-MM <linux-mm@kvack.org>,
	LTP List <ltp@lists.linux.it>,
	"xishi.qiuxishi@alibaba-inc.com" <xishi.qiuxishi@alibaba-inc.com>,
	Cyril Hrubis <chrubis@suse.cz>
Subject: Re: [MM Bug?] mmap() triggers SIGBUS while doing the​ ​numa_move_pages() for offlined hugepage in background
Date: Mon, 5 Aug 2019 10:36:55 -0700	[thread overview]
Message-ID: <7d78f6b9-afb8-79d1-003e-56de58fded00@oracle.com> (raw)
In-Reply-To: <20190805085740.GC7597@dhcp22.suse.cz>

On 8/5/19 1:57 AM, Michal Hocko wrote:
> On Fri 02-08-19 10:42:33, Mike Kravetz wrote:
>> On 8/1/19 9:15 PM, Naoya Horiguchi wrote:
>>> On Thu, Aug 01, 2019 at 05:19:41PM -0700, Mike Kravetz wrote:
>>>> There appears to be a race with hugetlb_fault and try_to_unmap_one of
>>>> the migration path.
>>>>
>>>> Can you try this patch in your environment?  I am not sure if it will
>>>> be the final fix, but just wanted to see if it addresses issue for you.
>>>>
>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>> index ede7e7f5d1ab..f3156c5432e3 100644
>>>> --- a/mm/hugetlb.c
>>>> +++ b/mm/hugetlb.c
>>>> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>>>>  
>>>>  		page = alloc_huge_page(vma, haddr, 0);
>>>>  		if (IS_ERR(page)) {
>>>> +			/*
>>>> +			 * We could race with page migration (try_to_unmap_one)
>>>> +			 * which is modifying page table with lock.  However,
>>>> +			 * we are not holding lock here.  Before returning
>>>> +			 * error that will SIGBUS caller, get ptl and make
>>>> +			 * sure there really is no entry.
>>>> +			 */
>>>> +			ptl = huge_pte_lock(h, mm, ptep);
>>>> +			if (!huge_pte_none(huge_ptep_get(ptep))) {
>>>> +				ret = 0;
>>>> +				spin_unlock(ptl);
>>>> +				goto out;
>>>> +			}
>>>> +			spin_unlock(ptl);
>>>
>>> Thanks you for investigation, Mike.
>>> I tried this change and found no SIGBUS, so it works well.
>>>
>>> I'm still not clear about how !huge_pte_none() becomes true here,
>>> because we enter hugetlb_no_page() only when huge_pte_none() is non-null
>>> and (racy) try_to_unmap_one() from page migration should convert the
>>> huge_pte into a migration entry, not null.
>>
>> Thanks for taking a look Naoya.
>>
>> In try_to_unmap_one(), there is this code block:
>>
>> 		/* Nuke the page table entry. */
>> 		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
>> 		if (should_defer_flush(mm, flags)) {
>> 			/*
>> 			 * We clear the PTE but do not flush so potentially
>> 			 * a remote CPU could still be writing to the page.
>> 			 * If the entry was previously clean then the
>> 			 * architecture must guarantee that a clear->dirty
>> 			 * transition on a cached TLB entry is written through
>> 			 * and traps if the PTE is unmapped.
>> 			 */
>> 			pteval = ptep_get_and_clear(mm, address, pvmw.pte);
>>
>> 			set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
>> 		} else {
>> 			pteval = ptep_clear_flush(vma, address, pvmw.pte);
>> 		}
>>
>> That happens before setting the migration entry.  Therefore, for a period
>> of time the pte is NULL (huge_pte_none() returns true).
>>
>> try_to_unmap_one holds the page table lock, but hugetlb_fault does not take
>> the lock to 'optimistically' check huge_pte_none().  When huge_pte_none
>> returns true, it calls hugetlb_no_page which is where we try to allocate
>> a page and fails.
>>
>> Does that make sense, or am I missing something?
>>
>> The patch checks for this specific condition: someone changing the pte
>> from NULL to non-NULL while holding the lock.  I am not sure if this is
>> the best way to fix.  But, it may be the easiest.
> 
> Please add a comment to explain this because this is quite subtle and
> tricky. Unlike the regular page fault hugetlb_no_page is protected by a
> large lock so a retry check seems unexpected.

Will do.

Fixing up hugetlbfs locking is still 'on my list'.  There are known issues.
The last RFC/attempt was this:
http://lkml.kernel.org/r/20190201221705.15622-1-mike.kravetz@oracle.com
I believe that patch would have handled this issue.

However, as mentioned above it may better to just patch this issue exposed
by LTP and work on the more comprehensive change in the background.
-- 
Mike Kravetz


WARNING: multiple messages have this Message-ID (diff)
From: Mike Kravetz <mike.kravetz@oracle.com>
To: ltp@lists.linux.it
Subject: [LTP]  [MM Bug?] mmap() triggers SIGBUS while doing the​ ​numa_move_pages() for offlined hugepage in background
Date: Mon, 5 Aug 2019 10:36:55 -0700	[thread overview]
Message-ID: <7d78f6b9-afb8-79d1-003e-56de58fded00@oracle.com> (raw)
In-Reply-To: <20190805085740.GC7597@dhcp22.suse.cz>

On 8/5/19 1:57 AM, Michal Hocko wrote:
> On Fri 02-08-19 10:42:33, Mike Kravetz wrote:
>> On 8/1/19 9:15 PM, Naoya Horiguchi wrote:
>>> On Thu, Aug 01, 2019 at 05:19:41PM -0700, Mike Kravetz wrote:
>>>> There appears to be a race with hugetlb_fault and try_to_unmap_one of
>>>> the migration path.
>>>>
>>>> Can you try this patch in your environment?  I am not sure if it will
>>>> be the final fix, but just wanted to see if it addresses issue for you.
>>>>
>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>> index ede7e7f5d1ab..f3156c5432e3 100644
>>>> --- a/mm/hugetlb.c
>>>> +++ b/mm/hugetlb.c
>>>> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>>>>  
>>>>  		page = alloc_huge_page(vma, haddr, 0);
>>>>  		if (IS_ERR(page)) {
>>>> +			/*
>>>> +			 * We could race with page migration (try_to_unmap_one)
>>>> +			 * which is modifying page table with lock.  However,
>>>> +			 * we are not holding lock here.  Before returning
>>>> +			 * error that will SIGBUS caller, get ptl and make
>>>> +			 * sure there really is no entry.
>>>> +			 */
>>>> +			ptl = huge_pte_lock(h, mm, ptep);
>>>> +			if (!huge_pte_none(huge_ptep_get(ptep))) {
>>>> +				ret = 0;
>>>> +				spin_unlock(ptl);
>>>> +				goto out;
>>>> +			}
>>>> +			spin_unlock(ptl);
>>>
>>> Thanks you for investigation, Mike.
>>> I tried this change and found no SIGBUS, so it works well.
>>>
>>> I'm still not clear about how !huge_pte_none() becomes true here,
>>> because we enter hugetlb_no_page() only when huge_pte_none() is non-null
>>> and (racy) try_to_unmap_one() from page migration should convert the
>>> huge_pte into a migration entry, not null.
>>
>> Thanks for taking a look Naoya.
>>
>> In try_to_unmap_one(), there is this code block:
>>
>> 		/* Nuke the page table entry. */
>> 		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
>> 		if (should_defer_flush(mm, flags)) {
>> 			/*
>> 			 * We clear the PTE but do not flush so potentially
>> 			 * a remote CPU could still be writing to the page.
>> 			 * If the entry was previously clean then the
>> 			 * architecture must guarantee that a clear->dirty
>> 			 * transition on a cached TLB entry is written through
>> 			 * and traps if the PTE is unmapped.
>> 			 */
>> 			pteval = ptep_get_and_clear(mm, address, pvmw.pte);
>>
>> 			set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
>> 		} else {
>> 			pteval = ptep_clear_flush(vma, address, pvmw.pte);
>> 		}
>>
>> That happens before setting the migration entry.  Therefore, for a period
>> of time the pte is NULL (huge_pte_none() returns true).
>>
>> try_to_unmap_one holds the page table lock, but hugetlb_fault does not take
>> the lock to 'optimistically' check huge_pte_none().  When huge_pte_none
>> returns true, it calls hugetlb_no_page which is where we try to allocate
>> a page and fails.
>>
>> Does that make sense, or am I missing something?
>>
>> The patch checks for this specific condition: someone changing the pte
>> from NULL to non-NULL while holding the lock.  I am not sure if this is
>> the best way to fix.  But, it may be the easiest.
> 
> Please add a comment to explain this because this is quite subtle and
> tricky. Unlike the regular page fault hugetlb_no_page is protected by a
> large lock so a retry check seems unexpected.

Will do.

Fixing up hugetlbfs locking is still 'on my list'.  There are known issues.
The last RFC/attempt was this:
http://lkml.kernel.org/r/20190201221705.15622-1-mike.kravetz@oracle.com
I believe that patch would have handled this issue.

However, as mentioned above it may better to just patch this issue exposed
by LTP and work on the more comprehensive change in the background.
-- 
Mike Kravetz

  reply	other threads:[~2019-08-05 17:37 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-29  5:17 [MM Bug?] mmap() triggers SIGBUS while doing the​ ​numa_move_pages() for offlined hugepage in background Li Wang
2019-07-29  5:17 ` [LTP] " Li Wang
2019-07-29 19:00 ` Mike Kravetz
2019-07-29 19:00   ` [LTP] " Mike Kravetz
2019-07-30  6:29   ` Li Wang
2019-07-30  6:29     ` [LTP] " Li Wang
2019-07-31  0:44     ` Mike Kravetz
2019-07-31  0:44       ` [LTP] " Mike Kravetz
2019-08-02  0:19       ` Mike Kravetz
2019-08-02  0:19         ` [LTP] " Mike Kravetz
2019-08-02  4:15         ` Naoya Horiguchi
2019-08-02  4:15           ` [LTP] " Naoya Horiguchi
2019-08-02 17:42           ` Mike Kravetz
2019-08-02 17:42             ` [LTP] " Mike Kravetz
2019-08-05  0:40             ` Naoya Horiguchi
2019-08-05  0:40               ` [LTP] " Naoya Horiguchi
2019-08-05  8:57             ` Michal Hocko
2019-08-05  8:57               ` [LTP] " Michal Hocko
2019-08-05 17:36               ` Mike Kravetz [this message]
2019-08-05 17:36                 ` Mike Kravetz
2019-08-07  0:07                 ` Mike Kravetz
2019-08-07  0:07                   ` [LTP] " Mike Kravetz
2019-08-07  7:39                   ` Michal Hocko
2019-08-07  7:39                     ` [LTP] " Michal Hocko
2019-08-07 15:10                     ` Mike Kravetz
2019-08-07 15:10                       ` [LTP] " Mike Kravetz
2019-08-02  9:59         ` Li Wang
2019-08-02  9:59           ` [LTP] " Li Wang
2019-07-30  6:38   ` Li Wang
2019-07-30  6:38     ` [LTP] " Li Wang
2019-08-02  3:48 ` Naoya Horiguchi
2019-08-02  3:48   ` [LTP] " Naoya Horiguchi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7d78f6b9-afb8-79d1-003e-56de58fded00@oracle.com \
    --to=mike.kravetz@oracle.com \
    --cc=chrubis@suse.cz \
    --cc=linux-mm@kvack.org \
    --cc=liwang@redhat.com \
    --cc=ltp@lists.linux.it \
    --cc=mhocko@suse.com \
    --cc=n-horiguchi@ah.jp.nec.com \
    --cc=xishi.qiuxishi@alibaba-inc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.