From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1754612AbdBGOUC (ORCPT <rfc822;w@1wt.eu>);
        Tue, 7 Feb 2017 09:20:02 -0500
Received: from mail-wr0-f195.google.com ([209.85.128.195]:34658 "EHLO
        mail-wr0-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1752495AbdBGOT7 (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Tue, 7 Feb 2017 09:19:59 -0500
Date: Tue, 7 Feb 2017 17:19:56 +0300
From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Zi Yan <zi.yan@sent.com>, Andrea Arcangeli <aarcange@redhat.com>,
        Minchan Kim <minchan@kernel.org>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
        kirill.shutemov@linux.intel.com, akpm@linux-foundation.org,
        minchan@kernel.org, vbabka@suse.cz, mgorman@techsingularity.net,
        n-horiguchi@ah.jp.nec.com, khandual@linux.vnet.ibm.com,
        zi.yan@cs.rutgers.edu, Zi Yan <ziy@nvidia.com>
Subject: Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in
 zap_pmd_range()
Message-ID: <20170207141956.GA4789@node.shutemov.name>
References: <20170205161252.85004-1-zi.yan@sent.com>
 <20170205161252.85004-4-zi.yan@sent.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170205161252.85004-4-zi.yan@sent.com>
User-Agent: Mutt/1.5.23.1 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Sun, Feb 05, 2017 at 11:12:41AM -0500, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Originally, zap_pmd_range() checks pmd value without taking pmd lock.
> This can cause pmd_protnone entry not being freed.
> 
> Because there are two steps in changing a pmd entry to a pmd_protnone
> entry. First, the pmd entry is cleared to a pmd_none entry, then,
> the pmd_none entry is changed into a pmd_protnone entry.
> The racy check, even with barrier, might only see the pmd_none entry
> in zap_pmd_range(), thus, the mapping is neither split nor zapped.

Okay, this only can happen to MADV_DONTNEED as we hold
down_write(mmap_sem) for the rest of zap_pmd_range() and whoever modifies
page tables has to hold at least down_read(mmap_sem) or exclude parallel
modification in other ways.

See 1a5a9906d4e8 ("mm: thp: fix pmd_bad() triggering in code paths holding
mmap_sem read mode") for more details.

+Andrea.

> Later, in free_pmd_range(), pmd_none_or_clear() will see the
> pmd_protnone entry and clear it as a pmd_bad entry. Furthermore,
> since the pmd_protnone entry is not properly freed, the corresponding
> deposited pte page table is not freed either.

free_pmd_range() should be fine: we only free page tables after vmas gone
(under down_write(mmap_sem() in exit_mmap() and unmap_region()) or after
pagetables moved (under down_write(mmap_sem) in shift_arg_pages()).

> This causes memory leak or kernel crashing, if VM_BUG_ON() is enabled.

The problem is that numabalancing calls change_huge_pmd() under
down_read(mmap_sem), not down_write(mmap_sem) as the rest of users do.
It makes numabalancing the only code path beyond page fault that can turn
pmd_none() into pmd_trans_huge() under down_read(mmap_sem).

This can lead to race when MADV_DONTNEED miss THP. That's not critical for
pagefault vs. MADV_DONTNEED race as we will end up with clear page in that
case. Not so much for change_huge_pmd().

Looks like we need pmdp_modify() or something to modify protection bits
inplace, without clearing pmd.

Not sure how to get crash scenario.

BTW, Zi, have you observed the crash? Or is it based on code inspection?
Any backtraces?

Ouch! madvise_free_huge_pmd() is broken too. We shouldn't clear pmd in the
middle of it as we only hold down_read(mmap_sem). I guess we need a helper
to clear both access and dirty bits.
Minchan, could you look into it?

-- 
 Kirill A. Shutemov