From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753010AbdEPU3Z (ORCPT ); Tue, 16 May 2017 16:29:25 -0400 Received: from mx1.redhat.com ([209.132.183.28]:59816 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751876AbdEPU3W (ORCPT ); Tue, 16 May 2017 16:29:22 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 4BE7481233 Authentication-Results: ext-mx01.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx01.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=aarcange@redhat.com DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com 4BE7481233 Date: Tue, 16 May 2017 22:29:19 +0200 From: Andrea Arcangeli To: Vlastimil Babka Cc: "Kirill A. Shutemov" , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race Message-ID: <20170516202919.GA2843@redhat.com> References: <20170302151034.27829-1-kirill.shutemov@linux.intel.com> <20170302151034.27829-3-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.8.2 (2017-04-18) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.25]); Tue, 16 May 2017 20:29:22 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 12, 2017 at 03:33:35PM +0200, Vlastimil Babka wrote: > On 03/02/2017 04:10 PM, Kirill A. Shutemov wrote: > > In case prot_numa, we are under down_read(mmap_sem). It's critical > > to not clear pmd intermittently to avoid race with MADV_DONTNEED > > which is also under down_read(mmap_sem): > > > > CPU0: CPU1: > > change_huge_pmd(prot_numa=1) > > pmdp_huge_get_and_clear_notify() > > madvise_dontneed() > > zap_pmd_range() > > pmd_trans_huge(*pmd) == 0 (without ptl) > > // skip the pmd > > set_pmd_at(); > > // pmd is re-established > > > > The race makes MADV_DONTNEED miss the huge pmd and don't clear it > > which may break userspace. > > > > Found by code analysis, never saw triggered. > > > > Signed-off-by: Kirill A. Shutemov > > --- > > mm/huge_memory.c | 34 +++++++++++++++++++++++++++++++++- > > 1 file changed, 33 insertions(+), 1 deletion(-) > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > index e7ce73b2b208..bb2b3646bd78 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -1744,7 +1744,39 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, > > if (prot_numa && pmd_protnone(*pmd)) > > goto unlock; > > > > - entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd); > > + /* > > + * In case prot_numa, we are under down_read(mmap_sem). It's critical > > + * to not clear pmd intermittently to avoid race with MADV_DONTNEED > > + * which is also under down_read(mmap_sem): > > + * > > + * CPU0: CPU1: > > + * change_huge_pmd(prot_numa=1) > > + * pmdp_huge_get_and_clear_notify() > > + * madvise_dontneed() > > + * zap_pmd_range() > > + * pmd_trans_huge(*pmd) == 0 (without ptl) > > + * // skip the pmd > > + * set_pmd_at(); > > + * // pmd is re-established > > + * > > + * The race makes MADV_DONTNEED miss the huge pmd and don't clear it > > + * which may break userspace. > > + * > > + * pmdp_invalidate() is required to make sure we don't miss > > + * dirty/young flags set by hardware. > > + */ > > + entry = *pmd; > > + pmdp_invalidate(vma, addr, pmd); > > + > > + /* > > + * Recover dirty/young flags. It relies on pmdp_invalidate to not > > + * corrupt them. > > + */ > > pmdp_invalidate() does: > > pmd_t entry = *pmdp; > set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry)); > > so it's not atomic and if CPU sets dirty or accessed in the middle of > this, they will be lost? I agree it looks like the dirty bit can be lost. Furthermore this also loses a MMU notifier invalidate that will lead to corruption at the secondary MMU level (which will keep using the old protection permission, potentially keeping writing to a wrprotected page). > > But I don't see how the other invalidate caller > __split_huge_pmd_locked() deals with this either. Andrea, any idea? The original code I wrote did this in __split_huge_page_map to create the "entry" to establish in the pte pagetables: entry = mk_pte(page + i, vma->vm_page_prot); entry = maybe_mkwrite(pte_mkdirty(entry), vma); For anonymous memory the dirty bit is only meaningful for swapping, and THP couldn't be swapped so setting it unconditional avoided any issue with the pmdp_invalidate; pmdp_establish. pmdp_invalidate is needed primarily to avoid aliasing of two different TLB translation pointing from the same virtual address to the the same physical address that triggered machine checks (while needing to keep the pmd huge at all times, back then it was also splitting huge, splitting is a software bit so userland could still access the data, splitting bit only blocked kernel code to manipulate on it similar to what migration entry does right now upstream, except those prevent userland to access the page during split which is less efficient than the splitting bit, but at least it's only used for the physical split, back then there was no difference between virtual and physical split and physical split is less frequent than the virtual one right now). It looks like this needs a pmdp_populate that atomically grabs the value of the pmd and returns it like pmdp_huge_get_and_clear_notify does and a _notify variant to use "freeze" is false (if freeze is true the MMU notifier invalidate must have happened when the pmd was set to a migration entry). If pmdp_populate_notify (freeze==true) /pmd_populate (freeze==false) would return the old pmd value atomically with xchg() (just instead of setting it to 0 we should set it to the mknotpresent one), then we can set the dirty bit on the ptes (__split_huge_pmd_locked) or in the pmd itself in the change_huge_pmd accordingly. If the "dirty" flag information is obtained by the pmd read before calling pmdp_invalidate is not ok (losing _notify also not ok). Thanks! Andrea From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qt0-f200.google.com (mail-qt0-f200.google.com [209.85.216.200]) by kanga.kvack.org (Postfix) with ESMTP id 98E816B02C4 for ; Tue, 16 May 2017 16:29:24 -0400 (EDT) Received: by mail-qt0-f200.google.com with SMTP id t26so60755890qtg.12 for ; Tue, 16 May 2017 13:29:24 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id b29si329663qtb.70.2017.05.16.13.29.23 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 16 May 2017 13:29:23 -0700 (PDT) Date: Tue, 16 May 2017 22:29:19 +0200 From: Andrea Arcangeli Subject: Re: [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race Message-ID: <20170516202919.GA2843@redhat.com> References: <20170302151034.27829-1-kirill.shutemov@linux.intel.com> <20170302151034.27829-3-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Vlastimil Babka Cc: "Kirill A. Shutemov" , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org On Wed, Apr 12, 2017 at 03:33:35PM +0200, Vlastimil Babka wrote: > On 03/02/2017 04:10 PM, Kirill A. Shutemov wrote: > > In case prot_numa, we are under down_read(mmap_sem). It's critical > > to not clear pmd intermittently to avoid race with MADV_DONTNEED > > which is also under down_read(mmap_sem): > > > > CPU0: CPU1: > > change_huge_pmd(prot_numa=1) > > pmdp_huge_get_and_clear_notify() > > madvise_dontneed() > > zap_pmd_range() > > pmd_trans_huge(*pmd) == 0 (without ptl) > > // skip the pmd > > set_pmd_at(); > > // pmd is re-established > > > > The race makes MADV_DONTNEED miss the huge pmd and don't clear it > > which may break userspace. > > > > Found by code analysis, never saw triggered. > > > > Signed-off-by: Kirill A. Shutemov > > --- > > mm/huge_memory.c | 34 +++++++++++++++++++++++++++++++++- > > 1 file changed, 33 insertions(+), 1 deletion(-) > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > index e7ce73b2b208..bb2b3646bd78 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -1744,7 +1744,39 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, > > if (prot_numa && pmd_protnone(*pmd)) > > goto unlock; > > > > - entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd); > > + /* > > + * In case prot_numa, we are under down_read(mmap_sem). It's critical > > + * to not clear pmd intermittently to avoid race with MADV_DONTNEED > > + * which is also under down_read(mmap_sem): > > + * > > + * CPU0: CPU1: > > + * change_huge_pmd(prot_numa=1) > > + * pmdp_huge_get_and_clear_notify() > > + * madvise_dontneed() > > + * zap_pmd_range() > > + * pmd_trans_huge(*pmd) == 0 (without ptl) > > + * // skip the pmd > > + * set_pmd_at(); > > + * // pmd is re-established > > + * > > + * The race makes MADV_DONTNEED miss the huge pmd and don't clear it > > + * which may break userspace. > > + * > > + * pmdp_invalidate() is required to make sure we don't miss > > + * dirty/young flags set by hardware. > > + */ > > + entry = *pmd; > > + pmdp_invalidate(vma, addr, pmd); > > + > > + /* > > + * Recover dirty/young flags. It relies on pmdp_invalidate to not > > + * corrupt them. > > + */ > > pmdp_invalidate() does: > > pmd_t entry = *pmdp; > set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry)); > > so it's not atomic and if CPU sets dirty or accessed in the middle of > this, they will be lost? I agree it looks like the dirty bit can be lost. Furthermore this also loses a MMU notifier invalidate that will lead to corruption at the secondary MMU level (which will keep using the old protection permission, potentially keeping writing to a wrprotected page). > > But I don't see how the other invalidate caller > __split_huge_pmd_locked() deals with this either. Andrea, any idea? The original code I wrote did this in __split_huge_page_map to create the "entry" to establish in the pte pagetables: entry = mk_pte(page + i, vma->vm_page_prot); entry = maybe_mkwrite(pte_mkdirty(entry), vma); For anonymous memory the dirty bit is only meaningful for swapping, and THP couldn't be swapped so setting it unconditional avoided any issue with the pmdp_invalidate; pmdp_establish. pmdp_invalidate is needed primarily to avoid aliasing of two different TLB translation pointing from the same virtual address to the the same physical address that triggered machine checks (while needing to keep the pmd huge at all times, back then it was also splitting huge, splitting is a software bit so userland could still access the data, splitting bit only blocked kernel code to manipulate on it similar to what migration entry does right now upstream, except those prevent userland to access the page during split which is less efficient than the splitting bit, but at least it's only used for the physical split, back then there was no difference between virtual and physical split and physical split is less frequent than the virtual one right now). It looks like this needs a pmdp_populate that atomically grabs the value of the pmd and returns it like pmdp_huge_get_and_clear_notify does and a _notify variant to use "freeze" is false (if freeze is true the MMU notifier invalidate must have happened when the pmd was set to a migration entry). If pmdp_populate_notify (freeze==true) /pmd_populate (freeze==false) would return the old pmd value atomically with xchg() (just instead of setting it to 0 we should set it to the mknotpresent one), then we can set the dirty bit on the ptes (__split_huge_pmd_locked) or in the pmd itself in the change_huge_pmd accordingly. If the "dirty" flag information is obtained by the pmd read before calling pmdp_invalidate is not ok (losing _notify also not ok). Thanks! Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org