linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Zi Yan" <zi.yan@sent.com>
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: mgorman@techsingularity.net, riel@redhat.com,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	kirill.shutemov@linux.intel.com, akpm@linux-foundation.org,
	minchan@kernel.org, vbabka@suse.cz, n-horiguchi@ah.jp.nec.com,
	khandual@linux.vnet.ibm.com, "Zi Yan" <ziy@nvidia.com>
Subject: Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
Date: Mon, 06 Feb 2017 10:32:10 -0600	[thread overview]
Message-ID: <1D482D89-0504-4E98-9931-B160BAEB3D75@sent.com> (raw)
In-Reply-To: <20170206160751.GA29962@node.shutemov.name>

[-- Attachment #1: Type: text/plain, Size: 4660 bytes --]

On 6 Feb 2017, at 10:07, Kirill A. Shutemov wrote:

> On Sun, Feb 05, 2017 at 11:12:41AM -0500, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> Originally, zap_pmd_range() checks pmd value without taking pmd lock.
>> This can cause pmd_protnone entry not being freed.
>>
>> Because there are two steps in changing a pmd entry to a pmd_protnone
>> entry. First, the pmd entry is cleared to a pmd_none entry, then,
>> the pmd_none entry is changed into a pmd_protnone entry.
>> The racy check, even with barrier, might only see the pmd_none entry
>> in zap_pmd_range(), thus, the mapping is neither split nor zapped.
>
> That's definately a good catch.
>
> But I don't agree with the solution. Taking pmd lock on each
> zap_pmd_range() is a significant hit by scalability of the code path.
> Yes, split ptl lock helps, but it would be nice to avoid the lock in first
> place.
>
> Can we fix change_huge_pmd() instead? Is there a reason why we cannot
> setup the pmd_protnone() atomically?

If you want to setup the pmd_protnone() atomically, we need a new way of
changing pmds, like pmdp_huge_cmp_exchange_and_clear(). Otherwise, due to
the nature of racy check of pmd in zap_pmd_range(), it is impossible to
eliminate the chance of catching this bug if pmd_protnone() is setup
in two steps: first, clear it, second, set it.

However, if we use pmdp_huge_cmp_exchange_and_clear() to change pmds from now on,
instead of current two-step approach, it will eliminate the possibility of
using batched TLB shootdown optimization (introduced by Mel Gorman for base page swapping)
when THP is swappable in the future. Maybe other optimizations?

Why do you think holding pmd lock is bad? In zap_pte_range(), pte lock
is also held when each PTE is zapped.

BTW, I am following Naoya's suggestion and going to take pmd lock inside
the loop. So pmd lock is held when each pmd is being checked and it will be released
when the pmd entry is zapped, split, or pointed to a page table.
Does it still hurt much on performance?

Thanks.



>
> Mel? Rik?
>
>>
>> Later, in free_pmd_range(), pmd_none_or_clear() will see the
>> pmd_protnone entry and clear it as a pmd_bad entry. Furthermore,
>> since the pmd_protnone entry is not properly freed, the corresponding
>> deposited pte page table is not freed either.
>>
>> This causes memory leak or kernel crashing, if VM_BUG_ON() is enabled.
>>
>> This patch relies on __split_huge_pmd_locked() and
>> __zap_huge_pmd_locked().
>>
>> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
>> ---
>>  mm/memory.c | 24 +++++++++++-------------
>>  1 file changed, 11 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 3929b015faf7..7cfdd5208ef5 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1233,33 +1233,31 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
>>  				struct zap_details *details)
>>  {
>>  	pmd_t *pmd;
>> +	spinlock_t *ptl;
>>  	unsigned long next;
>>
>>  	pmd = pmd_offset(pud, addr);
>> +	ptl = pmd_lock(vma->vm_mm, pmd);
>>  	do {
>>  		next = pmd_addr_end(addr, end);
>>  		if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
>>  			if (next - addr != HPAGE_PMD_SIZE) {
>>  				VM_BUG_ON_VMA(vma_is_anonymous(vma) &&
>>  				    !rwsem_is_locked(&tlb->mm->mmap_sem), vma);
>> -				__split_huge_pmd(vma, pmd, addr, false, NULL);
>> -			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
>> -				goto next;
>> +				__split_huge_pmd_locked(vma, pmd, addr, false);
>> +			} else if (__zap_huge_pmd_locked(tlb, vma, pmd, addr))
>> +				continue;
>>  			/* fall through */
>>  		}
>> -		/*
>> -		 * Here there can be other concurrent MADV_DONTNEED or
>> -		 * trans huge page faults running, and if the pmd is
>> -		 * none or trans huge it can change under us. This is
>> -		 * because MADV_DONTNEED holds the mmap_sem in read
>> -		 * mode.
>> -		 */
>> -		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
>> -			goto next;
>> +
>> +		if (pmd_none_or_clear_bad(pmd))
>> +			continue;
>> +		spin_unlock(ptl);
>>  		next = zap_pte_range(tlb, vma, pmd, addr, next, details);
>> -next:
>>  		cond_resched();
>> +		spin_lock(ptl);
>>  	} while (pmd++, addr = next, addr != end);
>> +	spin_unlock(ptl);
>>
>>  	return addr;
>>  }
>> -- 
>> 2.11.0
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
> -- 
>  Kirill A. Shutemov


--
Best Regards
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

  reply	other threads:[~2017-02-06 16:32 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-05 16:12 [PATCH v3 00/14] mm: page migration enhancement for thp Zi Yan
2017-02-05 16:12 ` [PATCH v3 01/14] mm: thp: make __split_huge_pmd_locked visible Zi Yan
2017-02-06  6:12   ` Naoya Horiguchi
2017-02-06 12:10     ` Zi Yan
2017-02-06 15:02   ` Matthew Wilcox
2017-02-06 15:03     ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 02/14] mm: thp: create new __zap_huge_pmd_locked function Zi Yan
2017-02-05 16:12 ` [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range() Zi Yan
2017-02-06  4:02   ` Hillf Danton
2017-02-06  4:14     ` Zi Yan
2017-02-06  7:43   ` Naoya Horiguchi
2017-02-06 13:02     ` Zi Yan
2017-02-06 23:22       ` Naoya Horiguchi
2017-02-06 16:07   ` Kirill A. Shutemov
2017-02-06 16:32     ` Zi Yan [this message]
2017-02-06 17:35       ` Kirill A. Shutemov
2017-02-07 13:55     ` Aneesh Kumar K.V
2017-02-07 14:19   ` Kirill A. Shutemov
2017-02-07 15:11     ` Zi Yan
2017-02-07 16:37       ` Kirill A. Shutemov
2017-02-07 17:14         ` Zi Yan
2017-02-07 17:45           ` Kirill A. Shutemov
2017-02-13  0:25             ` Zi Yan
2017-02-13 10:59               ` Kirill A. Shutemov
2017-02-13 14:40                 ` Andrea Arcangeli
2017-02-05 16:12 ` [PATCH v3 04/14] mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1 Zi Yan
2017-02-09  9:14   ` Naoya Horiguchi
2017-02-09 15:07     ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 05/14] mm: mempolicy: add queue_pages_node_check() Zi Yan
2017-02-05 16:12 ` [PATCH v3 06/14] mm: thp: introduce separate TTU flag for thp freezing Zi Yan
2017-02-05 16:12 ` [PATCH v3 07/14] mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION Zi Yan
2017-02-05 16:12 ` [PATCH v3 08/14] mm: thp: enable thp migration in generic path Zi Yan
2017-02-09  9:15   ` Naoya Horiguchi
2017-02-09 15:17     ` Zi Yan
2017-02-09 23:04       ` Naoya Horiguchi
2017-02-14 20:13   ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 09/14] mm: thp: check pmd migration entry in common path Zi Yan
2017-02-09  9:16   ` Naoya Horiguchi
2017-02-09 17:36     ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 10/14] mm: soft-dirty: keep soft-dirty bits over thp migration Zi Yan
2017-02-05 16:12 ` [PATCH v3 11/14] mm: hwpoison: soft offline supports " Zi Yan
2017-02-05 16:12 ` [PATCH v3 12/14] mm: mempolicy: mbind and migrate_pages support " Zi Yan
2017-02-05 16:12 ` [PATCH v3 13/14] mm: migrate: move_pages() supports " Zi Yan
2017-02-09  9:16   ` Naoya Horiguchi
2017-02-09 17:37     ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 14/14] mm: memory_hotplug: memory hotremove " Zi Yan
2017-02-23 16:12 ` [PATCH v3 00/14] mm: page migration enhancement for thp Zi Yan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1D482D89-0504-4E98-9931-B160BAEB3D75@sent.com \
    --to=zi.yan@sent.com \
    --cc=akpm@linux-foundation.org \
    --cc=khandual@linux.vnet.ibm.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=minchan@kernel.org \
    --cc=n-horiguchi@ah.jp.nec.com \
    --cc=riel@redhat.com \
    --cc=vbabka@suse.cz \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).