From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752785AbdEPOyh (ORCPT ); Tue, 16 May 2017 10:54:37 -0400 Received: from mx2.suse.de ([195.135.220.15]:58349 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1750765AbdEPOyf (ORCPT ); Tue, 16 May 2017 10:54:35 -0400 Subject: Re: [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race To: "Kirill A. Shutemov" , Andrea Arcangeli , Andrew Morton References: <20170302151034.27829-1-kirill.shutemov@linux.intel.com> <20170302151034.27829-3-kirill.shutemov@linux.intel.com> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andy Lutomirski From: Vlastimil Babka Message-ID: <90009469-f0af-1b82-868d-ce1adc6540cf@suse.cz> Date: Tue, 16 May 2017 16:54:01 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/12/2017 03:33 PM, Vlastimil Babka wrote: > On 03/02/2017 04:10 PM, Kirill A. Shutemov wrote: >> In case prot_numa, we are under down_read(mmap_sem). It's critical >> to not clear pmd intermittently to avoid race with MADV_DONTNEED >> which is also under down_read(mmap_sem): >> >> CPU0: CPU1: >> change_huge_pmd(prot_numa=1) >> pmdp_huge_get_and_clear_notify() >> madvise_dontneed() >> zap_pmd_range() >> pmd_trans_huge(*pmd) == 0 (without ptl) >> // skip the pmd >> set_pmd_at(); >> // pmd is re-established >> >> The race makes MADV_DONTNEED miss the huge pmd and don't clear it >> which may break userspace. >> >> Found by code analysis, never saw triggered. >> >> Signed-off-by: Kirill A. Shutemov >> --- >> mm/huge_memory.c | 34 +++++++++++++++++++++++++++++++++- >> 1 file changed, 33 insertions(+), 1 deletion(-) >> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index e7ce73b2b208..bb2b3646bd78 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -1744,7 +1744,39 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, >> if (prot_numa && pmd_protnone(*pmd)) >> goto unlock; >> >> - entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd); >> + /* >> + * In case prot_numa, we are under down_read(mmap_sem). It's critical >> + * to not clear pmd intermittently to avoid race with MADV_DONTNEED >> + * which is also under down_read(mmap_sem): >> + * >> + * CPU0: CPU1: >> + * change_huge_pmd(prot_numa=1) >> + * pmdp_huge_get_and_clear_notify() >> + * madvise_dontneed() >> + * zap_pmd_range() >> + * pmd_trans_huge(*pmd) == 0 (without ptl) >> + * // skip the pmd >> + * set_pmd_at(); >> + * // pmd is re-established >> + * >> + * The race makes MADV_DONTNEED miss the huge pmd and don't clear it >> + * which may break userspace. >> + * >> + * pmdp_invalidate() is required to make sure we don't miss >> + * dirty/young flags set by hardware. >> + */ >> + entry = *pmd; >> + pmdp_invalidate(vma, addr, pmd); >> + >> + /* >> + * Recover dirty/young flags. It relies on pmdp_invalidate to not >> + * corrupt them. >> + */ > > pmdp_invalidate() does: > > pmd_t entry = *pmdp; > set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry)); > > so it's not atomic and if CPU sets dirty or accessed in the middle of > this, they will be lost? > > But I don't see how the other invalidate caller > __split_huge_pmd_locked() deals with this either. Andrea, any idea? Looks like we didn't resolve this and meanwhile the patch is in mainline as ced108037c2aa. CC Andy who deals with TLB a lot these days. > Vlastimil > >> + if (pmd_dirty(*pmd)) >> + entry = pmd_mkdirty(entry); >> + if (pmd_young(*pmd)) >> + entry = pmd_mkyoung(entry); >> + >> entry = pmd_modify(entry, newprot); >> if (preserve_write) >> entry = pmd_mk_savedwrite(entry); >> > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org >