From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754612AbdBGOUC (ORCPT ); Tue, 7 Feb 2017 09:20:02 -0500 Received: from mail-wr0-f195.google.com ([209.85.128.195]:34658 "EHLO mail-wr0-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752495AbdBGOT7 (ORCPT ); Tue, 7 Feb 2017 09:19:59 -0500 Date: Tue, 7 Feb 2017 17:19:56 +0300 From: "Kirill A. Shutemov" To: Zi Yan , Andrea Arcangeli , Minchan Kim Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, kirill.shutemov@linux.intel.com, akpm@linux-foundation.org, minchan@kernel.org, vbabka@suse.cz, mgorman@techsingularity.net, n-horiguchi@ah.jp.nec.com, khandual@linux.vnet.ibm.com, zi.yan@cs.rutgers.edu, Zi Yan Subject: Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range() Message-ID: <20170207141956.GA4789@node.shutemov.name> References: <20170205161252.85004-1-zi.yan@sent.com> <20170205161252.85004-4-zi.yan@sent.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170205161252.85004-4-zi.yan@sent.com> User-Agent: Mutt/1.5.23.1 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Feb 05, 2017 at 11:12:41AM -0500, Zi Yan wrote: > From: Zi Yan > > Originally, zap_pmd_range() checks pmd value without taking pmd lock. > This can cause pmd_protnone entry not being freed. > > Because there are two steps in changing a pmd entry to a pmd_protnone > entry. First, the pmd entry is cleared to a pmd_none entry, then, > the pmd_none entry is changed into a pmd_protnone entry. > The racy check, even with barrier, might only see the pmd_none entry > in zap_pmd_range(), thus, the mapping is neither split nor zapped. Okay, this only can happen to MADV_DONTNEED as we hold down_write(mmap_sem) for the rest of zap_pmd_range() and whoever modifies page tables has to hold at least down_read(mmap_sem) or exclude parallel modification in other ways. See 1a5a9906d4e8 ("mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode") for more details. +Andrea. > Later, in free_pmd_range(), pmd_none_or_clear() will see the > pmd_protnone entry and clear it as a pmd_bad entry. Furthermore, > since the pmd_protnone entry is not properly freed, the corresponding > deposited pte page table is not freed either. free_pmd_range() should be fine: we only free page tables after vmas gone (under down_write(mmap_sem() in exit_mmap() and unmap_region()) or after pagetables moved (under down_write(mmap_sem) in shift_arg_pages()). > This causes memory leak or kernel crashing, if VM_BUG_ON() is enabled. The problem is that numabalancing calls change_huge_pmd() under down_read(mmap_sem), not down_write(mmap_sem) as the rest of users do. It makes numabalancing the only code path beyond page fault that can turn pmd_none() into pmd_trans_huge() under down_read(mmap_sem). This can lead to race when MADV_DONTNEED miss THP. That's not critical for pagefault vs. MADV_DONTNEED race as we will end up with clear page in that case. Not so much for change_huge_pmd(). Looks like we need pmdp_modify() or something to modify protection bits inplace, without clearing pmd. Not sure how to get crash scenario. BTW, Zi, have you observed the crash? Or is it based on code inspection? Any backtraces? Ouch! madvise_free_huge_pmd() is broken too. We shouldn't clear pmd in the middle of it as we only hold down_read(mmap_sem). I guess we need a helper to clear both access and dirty bits. Minchan, could you look into it? -- Kirill A. Shutemov